Package org.languagetool.tokenizers.en
Class EnglishWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.en.EnglishWordTokenizer
- All Implemented Interfaces:
Tokenizer
- Since:
- 2.5
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final Patternprivate static final Patternprivate static final Patternprivate final Stringprivate static final Patternprivate static final Patternprivate static final Patternprivate static final StringFields inherited from class org.languagetool.tokenizers.WordTokenizer
REMOVED_EMOJI -
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, getTokenizingCharacters, isCurrencyExpression, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls, replaceEmojis, restoreEmojis, splitCurrencyExpression
-
Field Details
-
wordCharacters
- See Also:
-
tokenizerPattern
-
SINGLE_QUOTE
-
CURLY_QUOTE
-
APOSTYPEW
-
APOSTYPOG
-
SOFT_HYPHEN
-
patternList
-
enTokenizingChars
-
-
Constructor Details
-
EnglishWordTokenizer
public EnglishWordTokenizer()
-
-
Method Details
-
tokenize
Tokenizes text. The English tokenizer differs from the standard one in two respects:- it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
- it includes n-dash as a tokenizing character, as it is used without a whitespace in English.
- Specified by:
tokenizein interfaceTokenizer- Overrides:
tokenizein classWordTokenizer- Parameters:
text- String of words to tokenize.
-
wordsToAdd
-