Package org.languagetool.tokenizers.fr
Class FrenchWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.fr.FrenchWordTokenizer
- All Implemented Interfaces:
Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace get its own
token. Special treatment for hyphens and apostrophes in French.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final Patternprivate static final Patternprivate final Stringprivate static final Pattern(package private) static final intprivate static final Pattern(package private) static final Pattern[]private static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final StringFields inherited from class org.languagetool.tokenizers.WordTokenizer
REMOVED_EMOJI -
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, getTokenizingCharacters, isCurrencyExpression, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls, replaceEmojis, restoreEmojis, splitCurrencyExpression
-
Field Details
-
wordCharacters
- See Also:
-
tokenizerPattern
-
TYPEWRITER_APOSTROPHE
-
TYPOGRAPHIC_APOSTROPHE
-
NEARBY_HYPHENS
-
HYPHENS
-
DECIMAL_POINT
-
DECIMAL_COMMA
-
SPACE_DIGITS0
-
SPACE_DIGITS
-
SPACE_DIGITS2
-
doNotSplit
-
frTokenizingChars
-
maxPatterns
static final int maxPatterns- See Also:
-
patterns
-
-
Constructor Details
-
FrenchWordTokenizer
public FrenchWordTokenizer()
-
-
Method Details
-
tokenize
- Specified by:
tokenizein interfaceTokenizer- Overrides:
tokenizein classWordTokenizer- Parameters:
text- Text to tokenize- Returns:
- List of tokens. Note: a special string xxFR_APOSxx is used to replace apostrophes, and xxFR_HYPHENxx to replace hyphens.
-
wordsToAdd
-