Class DefaultLanguageIdentifier
java.lang.Object
org.languagetool.language.identifier.LanguageIdentifier
org.languagetool.language.identifier.DefaultLanguageIdentifier
Identify the language of a text. Note that some languages might never be
detected because they are close to another language. Language variants like
en-US or en-GB are not detected, the result will be
en for those.
By default, only the first 1000 characters of a text are considered.
Email signatures that use \n-- \n as a delimiter are ignored.- Since:
- 2.9
-
Nested Class Summary
Nested classes/interfaces inherited from class org.languagetool.language.identifier.LanguageIdentifier
LanguageIdentifier.ParsedLanguageLists -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final intprivate static final floatprivate FastTextDetectorprivate final AtomicIntegerprivate final com.optimaize.langdetect.LanguageDetectorprivate static final org.slf4j.Loggerprivate static final doubleprivate NGramDetectorprivate static final intprivate final com.optimaize.langdetect.text.TextObjectFactoryFields inherited from class org.languagetool.language.identifier.LanguageIdentifier
COMMON_WORDS_LANG_IDENTIFIER, maxLength, NON_LATIN_CHARS_LANGUAGES, REMOVE_EMAIL_SIGNATURE_FILTER, REMOVE_MENTION_FILTER, REMOVE_NON_BREAKING_SPACES_FILTER, REMOVE_URL_FILTER, SCORE_THRESHOLD, UNICODE_BASED_LANG_IDENTIFIER -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptiondetectLanguage(String cleanText) detectLanguage(String cleanText, List<String> noopLangsTmp, List<String> preferredLangsTmp, boolean limitOnPreferredLangs) detectLanguageCode(String text, List<String> preferredLangs, boolean limitOnPreferredLangs) (package private) voidenableFasttext(File fasttextBinary, File fasttextModel) (package private) voidenableNgrams(File ngramDir) getDetectedLanguageScores(String cleanText, List<String> noopLangsTmp, List<String> preferredLangsTmp, boolean limitOnPreferredLangs, int count) For test onlybooleanprivate List<com.optimaize.langdetect.profiles.LanguageProfile> loadProfiles(List<String> langCodes) private voidvoidsetFastTextDetector(FastTextDetector fastTextDetector) For test onlyMethods inherited from class org.languagetool.language.identifier.LanguageIdentifier
cleanAndShortenText, getHighestScoringResult, getOrderedScores, prepareDetectLanguage
-
Field Details
-
logger
private static final org.slf4j.Logger logger -
MINIMAL_CONFIDENCE
private static final double MINIMAL_CONFIDENCE- See Also:
-
SHORT_ALGO_THRESHOLD
private static final int SHORT_ALGO_THRESHOLD- See Also:
-
CONSIDER_ONLY_PREFERRED_THRESHOLD
private static final int CONSIDER_ONLY_PREFERRED_THRESHOLD- See Also:
-
ignoreLangCodes
-
externalLangCodes
-
FASTTEXT_CONFIDENCE_THRESHOLD
private static final float FASTTEXT_CONFIDENCE_THRESHOLD- See Also:
-
languageDetector
private final com.optimaize.langdetect.LanguageDetector languageDetector -
textObjectFactory
private final com.optimaize.langdetect.text.TextObjectFactory textObjectFactory -
fasttextInitCounter
-
fastTextDetector
-
ngram
-
-
Constructor Details
-
DefaultLanguageIdentifier
DefaultLanguageIdentifier() -
DefaultLanguageIdentifier
DefaultLanguageIdentifier(int maxLength) - Parameters:
maxLength- the maximum number of characters that will be considered - can help with performance. Don't use values below 100, as this would decrease accuracy.- Throws:
IllegalArgumentException- ifmaxLengthis less than 10- Since:
- 4.2
-
-
Method Details
-
enableFasttext
-
setFastTextDetector
For test only -
getFasttextInitCounter
For test only- Returns:
- a counter how often fasttext was already recreated after a failure
-
isFastTextEnabled
public boolean isFastTextEnabled()- Since:
- 5.2
-
enableNgrams
-
getLanguageCodes
-
loadProfiles
private List<com.optimaize.langdetect.profiles.LanguageProfile> loadProfiles(List<String> langCodes) throws IOException - Throws:
IOException
-
detectLanguage
- Specified by:
detectLanguagein classLanguageIdentifier- Parameters:
cleanText- a cleanText as returned byLanguageIdentifier.cleanAndShortenText(String)- Returns:
- language or
nullif language could not be identified
-
detectLanguage
public DetectedLanguage detectLanguage(String cleanText, List<String> noopLangsTmp, List<String> preferredLangsTmp) - Specified by:
detectLanguagein classLanguageIdentifier- Parameters:
cleanText- a cleanText as returned byLanguageIdentifier.cleanAndShortenText(String)noopLangsTmp- list of codes that are detected but will lead to the NoopLanguage that has no rules- Returns:
- language or
nullif language could not be identified - Since:
- 4.4 (new parameter noopLangs, changed return type to DetectedLanguage)
-
detectLanguage
@Nullable public DetectedLanguage detectLanguage(String cleanText, List<String> noopLangsTmp, List<String> preferredLangsTmp, boolean limitOnPreferredLangs) - Specified by:
detectLanguagein classLanguageIdentifier
-
getDetectedLanguageScores
@NotNull public List<DetectedLanguage> getDetectedLanguageScores(String cleanText, List<String> noopLangsTmp, List<String> preferredLangsTmp, boolean limitOnPreferredLangs, int count) - Specified by:
getDetectedLanguageScoresin classLanguageIdentifier
-
reinitFasttextAfterFailure
-
detectLanguageCode
-