Class EnglishWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.en.EnglishWordTokenizer
All Implemented Interfaces:
Tokenizer

public class EnglishWordTokenizer extends WordTokenizer
Since:
2.5
  • Field Details

    • wordCharacters

      private static final String wordCharacters
      See Also:
    • tokenizerPattern

      private static final Pattern tokenizerPattern
    • SINGLE_QUOTE

      private static final Pattern SINGLE_QUOTE
    • CURLY_QUOTE

      private static final Pattern CURLY_QUOTE
    • APOSTYPEW

      private static final Pattern APOSTYPEW
    • APOSTYPOG

      private static final Pattern APOSTYPOG
    • SOFT_HYPHEN

      private static final Pattern SOFT_HYPHEN
    • patternList

      private static final List<Pattern> patternList
    • enTokenizingChars

      private final String enTokenizingChars
  • Constructor Details

    • EnglishWordTokenizer

      public EnglishWordTokenizer()
  • Method Details

    • tokenize

      public List<String> tokenize(String text)
      Tokenizes text. The English tokenizer differs from the standard one in two respects:
      1. it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
      2. it includes n-dash as a tokenizing character, as it is used without a whitespace in English.
      Specified by:
      tokenize in interface Tokenizer
      Overrides:
      tokenize in class WordTokenizer
      Parameters:
      text - String of words to tokenize.
    • wordsToAdd

      private List<String> wordsToAdd(String s)