Class WordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
All Implemented Interfaces:
Tokenizer
Direct Known Subclasses:
ArabicWordTokenizer, BelarusianWordTokenizer, BretonWordTokenizer, CatalanWordTokenizer, CrimeanTatarWordTokenizer, DutchWordTokenizer, EnglishWordTokenizer, EsperantoWordTokenizer, FrenchWordTokenizer, GalicianWordTokenizer, GermanWordTokenizer, GoogleStyleWordTokenizer, GreekWordTokenizer, KhmerWordTokenizer, PersianWordTokenizer, PolishWordTokenizer, PortugueseWordTokenizer, RomanianWordTokenizer, RussianWordTokenizer, SpanishWordTokenizer, TagalogWordTokenizer

public class WordTokenizer extends Object implements Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. The tokenizer is a quite simple character-based one, though it knows about urls and will put them in one token, if fully specified including a protocol (like http://foobar.org).
  • Field Details

    • PROTOCOLS

      private static final List<String> PROTOCOLS
    • URL_CHARS

      private static final Pattern URL_CHARS
    • DOMAIN_CHARS

      private static final Pattern DOMAIN_CHARS
    • NO_PROTOCOL_URL

      private static final Pattern NO_PROTOCOL_URL
    • E_MAIL

      private static final Pattern E_MAIL
    • CURRENCY_SYMBOLS

      private static final Pattern CURRENCY_SYMBOLS
    • CURRENCY_VALUE

      private static final Pattern CURRENCY_VALUE
    • CURRENCY_EXPRESSION

      private static final Pattern CURRENCY_EXPRESSION
    • TOKENIZING_CHARACTERS

      private static final String TOKENIZING_CHARACTERS
      See Also:
    • REMOVED_EMOJI

      protected final String REMOVED_EMOJI
      See Also:
  • Constructor Details

    • WordTokenizer

      public WordTokenizer()
  • Method Details

    • getProtocols

      public static List<String> getProtocols()
      Get the protocols that the tokenizer knows about.
      Returns:
      currently http, https, and ftp
      Since:
      2.1
    • isUrl

      public static boolean isUrl(String token)
      Since:
      3.0
    • isEMail

      public static boolean isEMail(String token)
      Since:
      3.5
    • tokenize

      public List<String> tokenize(String text)
      Specified by:
      tokenize in interface Tokenizer
    • getTokenizingCharacters

      public String getTokenizingCharacters()
      Returns:
      The string containing the characters used by the tokenizer to tokenize words.
      Since:
      2.5
    • joinEMailsAndUrls

      protected List<String> joinEMailsAndUrls(List<String> list)
    • joinEMails

      protected List<String> joinEMails(List<String> list)
      Since:
      3.5
    • joinUrls

      protected List<String> joinUrls(List<String> l)
    • urlStartsAt

      private boolean urlStartsAt(int i, List<String> l)
    • isProtocol

      private boolean isProtocol(String token)
    • urlEndsAt

      private boolean urlEndsAt(int i, List<String> l, String urlQuote)
    • isCurrencyExpression

      public boolean isCurrencyExpression(String token)
    • splitCurrencyExpression

      public List<String> splitCurrencyExpression(String token)
    • replaceEmojis

      public List<String> replaceEmojis(String s)
    • restoreEmojis

      public List<String> restoreEmojis(List<String> tokens, List<String> removedEmojis)