Class CrimeanTatarWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.crh.CrimeanTatarWordTokenizer
All Implemented Interfaces:
Tokenizer

public class CrimeanTatarWordTokenizer extends WordTokenizer
  • Constructor Details

    • CrimeanTatarWordTokenizer

      public CrimeanTatarWordTokenizer()
  • Method Details

    • getTokenizingCharacters

      public String getTokenizingCharacters()
      Overrides:
      getTokenizingCharacters in class WordTokenizer
      Returns:
      The string containing the characters used by the tokenizer to tokenize words.
    • tokenize

      public List<String> tokenize(String text)
      Tokenizes text. The CrimeanTatar tokenizer differs from the standard one in two respects:
      1. it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
      2. it includes n-dash as a tokenizing character, as it is used without a whitespace in CrimeanTatar.
      Specified by:
      tokenize in interface Tokenizer
      Overrides:
      tokenize in class WordTokenizer
      Parameters:
      text - String of words to tokenize.