Package org.languagetool.tokenizers.ca
Class CatalanWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.ca.CatalanWordTokenizer
- All Implemented Interfaces:
Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets its own token.
Special treatment for hyphens and apostrophes in Catalan.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternstatic final CatalanWordTokenizerprivate static final intprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate final Pattern[]private static final Stringprivate static final Patternprivate static final Patternprivate static final Patternprivate static final Patternprivate static final StringFields inherited from class org.languagetool.tokenizers.WordTokenizer
REMOVED_EMOJI -
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, getTokenizingCharacters, isCurrencyExpression, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls, replaceEmojis, restoreEmojis, splitCurrencyExpression
-
Field Details
-
INSTANCE
-
wordCharacters
- See Also:
-
tokenizerPattern
-
PF
- See Also:
-
PATTERN_1
-
PATTERN_2
-
PATTERN_3
-
PATTERN_4
-
PATTERN_5
-
PATTERN_6
-
PATTERN_7
-
PATTERN_8
-
maxPatterns
private static final int maxPatterns- See Also:
-
patterns
-
ELA_GEMINADA
-
ELA_GEMINADA_UPPERCASE
-
APOSTROF_RECTE
-
APOSTROF_RODO
-
APOSTROF_RECTE_1
-
APOSTROF_RODO_1
-
DECIMAL_POINT
-
DECIMAL_COMMA
-
SPACE_DIGITS0
-
SPACE_DIGITS
-
SPACE_DIGITS2
-
HYPHEN_L
-
-
Constructor Details
-
CatalanWordTokenizer
public CatalanWordTokenizer()
-
-
Method Details
-
tokenize
- Specified by:
tokenizein interfaceTokenizer- Overrides:
tokenizein classWordTokenizer- Parameters:
text- Text to tokenize- Returns:
- List of tokens. Note: a special string xxCA_APOSxx is used to replace apostrophes, and xxCA_HYPHENxx to replace hyphens.
-
wordsToAdd
-