Package org.languagetool.tokenizers.crh
Class CrimeanTatarWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.crh.CrimeanTatarWordTokenizer
- All Implemented Interfaces:
Tokenizer
-
Field Summary
Fields inherited from class org.languagetool.tokenizers.WordTokenizer
REMOVED_EMOJI -
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, isCurrencyExpression, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls, replaceEmojis, restoreEmojis, splitCurrencyExpression
-
Constructor Details
-
CrimeanTatarWordTokenizer
public CrimeanTatarWordTokenizer()
-
-
Method Details
-
getTokenizingCharacters
- Overrides:
getTokenizingCharactersin classWordTokenizer- Returns:
- The string containing the characters used by the tokenizer to tokenize words.
-
tokenize
Tokenizes text. The CrimeanTatar tokenizer differs from the standard one in two respects:- it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
- it includes n-dash as a tokenizing character, as it is used without a whitespace in CrimeanTatar.
- Specified by:
tokenizein interfaceTokenizer- Overrides:
tokenizein classWordTokenizer- Parameters:
text- String of words to tokenize.
-