Class WordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
All Implemented Interfaces:
Tokenizer

public class WordTokenizer extends Object implements Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. The tokenizer is a quite simple character-based one, though it knows about urls and will put them in one token, if fully specified including a protocol (like http://foobar.org).
Author:
Daniel Naber
  • Constructor Details

    • WordTokenizer

      public WordTokenizer()
  • Method Details

    • getProtocols

      public static List<String> getProtocols()
      Get the protocols that the tokenizer knows about.
      Returns:
      currently http, https, and ftp
      Since:
      2.1
    • isUrl

      public static boolean isUrl(String token)
      Since:
      3.0
    • isEMail

      public static boolean isEMail(String token)
      Since:
      3.5
    • tokenize

      public List<String> tokenize(String text)
      Specified by:
      tokenize in interface Tokenizer
    • getTokenizingCharacters

      public String getTokenizingCharacters()
      Returns:
      The string containing the characters used by the tokenizer to tokenize words.
      Since:
      2.5
    • joinEMailsAndUrls

      protected List<String> joinEMailsAndUrls(List<String> list)
    • joinEMails

      protected List<String> joinEMails(List<String> list)
      Since:
      3.5
    • joinUrls

      protected List<String> joinUrls(List<String> l)