Package org.languagetool.tokenizers
Class WordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
- All Implemented Interfaces:
Tokenizer
- Direct Known Subclasses:
ArabicWordTokenizer,BelarusianWordTokenizer,BretonWordTokenizer,CatalanWordTokenizer,CrimeanTatarWordTokenizer,DutchWordTokenizer,EnglishWordTokenizer,EsperantoWordTokenizer,FrenchWordTokenizer,GalicianWordTokenizer,GermanWordTokenizer,GoogleStyleWordTokenizer,GreekWordTokenizer,KhmerWordTokenizer,PersianWordTokenizer,PolishWordTokenizer,PortugueseWordTokenizer,RomanianWordTokenizer,RussianWordTokenizer,SpanishWordTokenizer,TagalogWordTokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens.
The tokenizer is a quite simple character-based one, though it knows
about urls and will put them in one token, if fully specified including
a protocol (like
http://foobar.org).-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionGet the protocols that the tokenizer knows about.booleanisCurrencyExpression(String token) static booleanprivate booleanisProtocol(String token) static booleanjoinEMails(List<String> list) joinEMailsAndUrls(List<String> list) restoreEmojis(List<String> tokens, List<String> removedEmojis) splitCurrencyExpression(String token) private booleanprivate booleanurlStartsAt(int i, List<String> l)
-
Field Details
-
PROTOCOLS
-
URL_CHARS
-
DOMAIN_CHARS
-
NO_PROTOCOL_URL
-
E_MAIL
-
CURRENCY_SYMBOLS
-
CURRENCY_VALUE
-
CURRENCY_EXPRESSION
-
TOKENIZING_CHARACTERS
- See Also:
-
REMOVED_EMOJI
- See Also:
-
-
Constructor Details
-
WordTokenizer
public WordTokenizer()
-
-
Method Details
-
getProtocols
Get the protocols that the tokenizer knows about.- Returns:
- currently
http,https, andftp - Since:
- 2.1
-
isUrl
- Since:
- 3.0
-
isEMail
- Since:
- 3.5
-
tokenize
-
getTokenizingCharacters
- Returns:
- The string containing the characters used by the tokenizer to tokenize words.
- Since:
- 2.5
-
joinEMailsAndUrls
-
joinEMails
- Since:
- 3.5
-
joinUrls
-
urlStartsAt
-
isProtocol
-
urlEndsAt
-
isCurrencyExpression
-
splitCurrencyExpression
-
replaceEmojis
-
restoreEmojis
-