Package org.languagetool.tokenizers
Class WordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
- All Implemented Interfaces:
Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens.
The tokenizer is a quite simple character-based one, though it knows
about urls and will put them in one token, if fully specified including
a protocol (like
http://foobar.org).- Author:
- Daniel Naber
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionGet the protocols that the tokenizer knows about.static booleanstatic booleanjoinEMails(List<String> list) joinEMailsAndUrls(List<String> list)
-
Constructor Details
-
WordTokenizer
public WordTokenizer()
-
-
Method Details
-
getProtocols
Get the protocols that the tokenizer knows about.- Returns:
- currently
http,https, andftp - Since:
- 2.1
-
isUrl
- Since:
- 3.0
-
isEMail
- Since:
- 3.5
-
tokenize
-
getTokenizingCharacters
- Returns:
- The string containing the characters used by the tokenizer to tokenize words.
- Since:
- 2.5
-
joinEMailsAndUrls
-
joinEMails
- Since:
- 3.5
-
joinUrls
-