Package org.languagetool.tokenizers.pt
Class PortugueseWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.pt.PortugueseWordTokenizer
- All Implemented Interfaces:
Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets its own token.
- Since:
- 3.6
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final Patternprivate static final Stringprivate static final Patternprivate static final Stringprivate static final Patternprivate static final Stringprivate static final charprivate static final Patternprivate static final Patternprivate static final Stringprivate static final Patternprivate static final Stringprivate static final Patternprivate static final Stringprivate static final Patternprivate static final Stringprivate static final Patternprivate static final Stringprivate static final charprivate static final charprivate static final charprivate final PortugueseTaggerprivate final Stringprivate final Stringprivate final Stringprivate final PatternFields inherited from class org.languagetool.tokenizers.WordTokenizer
REMOVED_EMOJI -
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, getTokenizingCharacters, isCurrencyExpression, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls, replaceEmojis, restoreEmojis, splitCurrencyExpression
-
Field Details
-
tagger
-
DECIMAL_COMMA_SUBST
private static final char DECIMAL_COMMA_SUBST- See Also:
-
NON_BREAKING_SPACE_SUBST
private static final char NON_BREAKING_SPACE_SUBST- See Also:
-
NON_BREAKING_DOT_SUBST
private static final char NON_BREAKING_DOT_SUBST- See Also:
-
NON_BREAKING_COLON_SUBST
private static final char NON_BREAKING_COLON_SUBST- See Also:
-
HYPHEN_SUBST_TEXT
- See Also:
-
HYPHEN_SUBST
-
DECIMAL_COMMA_PATTERN
-
DECIMAL_COMMA_REPL
- See Also:
-
DECIMAL_SPACE_PATTERN
-
DOTTED_NUMBERS_PATTERN
-
DOTTED_NUMBERS_REPL
- See Also:
-
COLON_NUMBERS_PATTERN
-
COLON_NUMBERS_REPL
- See Also:
-
DATE_PATTERN
-
DATE_PATTERN_REPL
- See Also:
-
DOTTED_ORDINALS_PATTERN
-
DOTTED_ORDINALS_REPL
- See Also:
-
HYPHEN_PATTERN
-
HYPHEN_REPL
-
NEARBY_HYPHENS_PATTERN
-
NEARBY_HYPHENS_REPL
-
wordChars
-
wordCharsLeftEdge
- See Also:
-
wordCharsRightEdge
- See Also:
-
wordPattern
-
-
Constructor Details
-
PortugueseWordTokenizer
public PortugueseWordTokenizer()
-
-
Method Details
-
tokenize
- Specified by:
tokenizein interfaceTokenizer- Overrides:
tokenizein classWordTokenizer
-
wordsToAdd
-