Class CatalanWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.ca.CatalanWordTokenizer
All Implemented Interfaces:
Tokenizer

public class CatalanWordTokenizer extends WordTokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets its own token. Special treatment for hyphens and apostrophes in Catalan.
  • Field Details

    • INSTANCE

      public static final CatalanWordTokenizer INSTANCE
    • wordCharacters

      private static final String wordCharacters
      See Also:
    • tokenizerPattern

      private static final Pattern tokenizerPattern
    • PF

      private static final String PF
      See Also:
    • PATTERN_1

      private static final Pattern PATTERN_1
    • PATTERN_2

      private static final Pattern PATTERN_2
    • PATTERN_3

      private static final Pattern PATTERN_3
    • PATTERN_4

      private static final Pattern PATTERN_4
    • PATTERN_5

      private static final Pattern PATTERN_5
    • PATTERN_6

      private static final Pattern PATTERN_6
    • PATTERN_7

      private static final Pattern PATTERN_7
    • PATTERN_8

      private static final Pattern PATTERN_8
    • maxPatterns

      private static final int maxPatterns
      See Also:
    • patterns

      private final Pattern[] patterns
    • ELA_GEMINADA

      private static final Pattern ELA_GEMINADA
    • ELA_GEMINADA_UPPERCASE

      private static final Pattern ELA_GEMINADA_UPPERCASE
    • APOSTROF_RECTE

      private static final Pattern APOSTROF_RECTE
    • APOSTROF_RODO

      private static final Pattern APOSTROF_RODO
    • APOSTROF_RECTE_1

      private static final Pattern APOSTROF_RECTE_1
    • APOSTROF_RODO_1

      private static final Pattern APOSTROF_RODO_1
    • DECIMAL_POINT

      private static final Pattern DECIMAL_POINT
    • DECIMAL_COMMA

      private static final Pattern DECIMAL_COMMA
    • SPACE_DIGITS0

      private static final Pattern SPACE_DIGITS0
    • SPACE_DIGITS

      private static final Pattern SPACE_DIGITS
    • SPACE_DIGITS2

      private static final Pattern SPACE_DIGITS2
    • HYPHEN_L

      private static final Pattern HYPHEN_L
  • Constructor Details

    • CatalanWordTokenizer

      public CatalanWordTokenizer()
  • Method Details

    • tokenize

      public List<String> tokenize(String text)
      Specified by:
      tokenize in interface Tokenizer
      Overrides:
      tokenize in class WordTokenizer
      Parameters:
      text - Text to tokenize
      Returns:
      List of tokens. Note: a special string xxCA_APOSxx is used to replace apostrophes, and xxCA_HYPHENxx to replace hyphens.
    • wordsToAdd

      private List<String> wordsToAdd(String s)