Class FrenchWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.fr.FrenchWordTokenizer
All Implemented Interfaces:
Tokenizer

public class FrenchWordTokenizer extends WordTokenizer
Tokenizes a sentence into words. Punctuation and whitespace get its own token. Special treatment for hyphens and apostrophes in French.
  • Field Details

    • wordCharacters

      private static final String wordCharacters
      See Also:
    • tokenizerPattern

      private static final Pattern tokenizerPattern
    • TYPEWRITER_APOSTROPHE

      private static final Pattern TYPEWRITER_APOSTROPHE
    • TYPOGRAPHIC_APOSTROPHE

      private static final Pattern TYPOGRAPHIC_APOSTROPHE
    • NEARBY_HYPHENS

      private static final Pattern NEARBY_HYPHENS
    • HYPHENS

      private static final Pattern HYPHENS
    • DECIMAL_POINT

      private static final Pattern DECIMAL_POINT
    • DECIMAL_COMMA

      private static final Pattern DECIMAL_COMMA
    • SPACE_DIGITS0

      private static final Pattern SPACE_DIGITS0
    • SPACE_DIGITS

      private static final Pattern SPACE_DIGITS
    • SPACE_DIGITS2

      private static final Pattern SPACE_DIGITS2
    • doNotSplit

      private static final List<String> doNotSplit
    • frTokenizingChars

      private final String frTokenizingChars
    • maxPatterns

      static final int maxPatterns
      See Also:
    • patterns

      static final Pattern[] patterns
  • Constructor Details

    • FrenchWordTokenizer

      public FrenchWordTokenizer()
  • Method Details

    • tokenize

      public List<String> tokenize(String text)
      Specified by:
      tokenize in interface Tokenizer
      Overrides:
      tokenize in class WordTokenizer
      Parameters:
      text - Text to tokenize
      Returns:
      List of tokens. Note: a special string xxFR_APOSxx is used to replace apostrophes, and xxFR_HYPHENxx to replace hyphens.
    • wordsToAdd

      private List<String> wordsToAdd(String s)