Class StringTools

java.lang.Object
org.languagetool.tools.StringTools

public final class StringTools extends Object
Tools for working with strings.
  • Field Details

    • NONCHAR

      private static final Pattern NONCHAR
    • WORD_FOR_SPELLER

      private static final Pattern WORD_FOR_SPELLER
    • IS_NUMERIC

      private static final Pattern IS_NUMERIC
    • UPPERCASE_GREEK_LETTERS

      public static final Set<String> UPPERCASE_GREEK_LETTERS
    • LOWERCASE_GREEK_LETTERS

      public static final Set<String> LOWERCASE_GREEK_LETTERS
    • WHITESPACE_ARRAY

      private static final String[] WHITESPACE_ARRAY
    • CHARS_NOT_FOR_SPELLING

      public static final Pattern CHARS_NOT_FOR_SPELLING
    • XML_COMMENT_PATTERN

      private static final Pattern XML_COMMENT_PATTERN
    • XML_PATTERN

      private static final Pattern XML_PATTERN
    • PUNCTUATION_PATTERN

      private static final Pattern PUNCTUATION_PATTERN
    • NOT_WORD_CHARACTER

      private static final Pattern NOT_WORD_CHARACTER
    • NOT_WORD_STR

      private static final Pattern NOT_WORD_STR
    • PATTERN

      private static final Pattern PATTERN
    • DIACRIT_MARKS

      private static final Pattern DIACRIT_MARKS
    • ENGLISH_TITLECASE_EXCEPTIONS

      private static final Set<String> ENGLISH_TITLECASE_EXCEPTIONS
    • PORTUGUESE_TITLECASE_EXCEPTIONS

      private static final Set<String> PORTUGUESE_TITLECASE_EXCEPTIONS
    • FRENCH_TITLECASE_EXCEPTIONS

      private static final Set<String> FRENCH_TITLECASE_EXCEPTIONS
    • SPANISH_TITLECASE_EXCEPTIONS

      private static final Set<String> SPANISH_TITLECASE_EXCEPTIONS
    • GERMAN_TITLECASE_EXCEPTIONS

      private static final Set<String> GERMAN_TITLECASE_EXCEPTIONS
    • DUTCH_TITLECASE_EXCEPTIONS

      private static final Set<String> DUTCH_TITLECASE_EXCEPTIONS
    • ALL_TITLECASE_EXCEPTIONS

      private static final Set<String> ALL_TITLECASE_EXCEPTIONS
  • Constructor Details

    • StringTools

      private StringTools()
  • Method Details

    • assureSet

      public static void assureSet(String s, String varName)
      Throw exception if the given string is null or empty or only whitespace.
    • readStream

      public static String readStream(InputStream stream, String encoding) throws IOException
      Read the text stream using the given encoding.
      Parameters:
      stream - InputStream the stream to be read
      encoding - the stream's character encoding, e.g. utf-8, or null to use the system encoding
      Returns:
      a string with the stream's content, lines separated by \n (note that \n will be added to the last line even if it is not in the stream)
      Throws:
      IOException
      Since:
      2.3
    • isAllUppercase

      public static boolean isAllUppercase(String str)
      Returns true if the given string is made up of all-uppercase characters (ignoring characters for which no upper-/lowercase distinction exists).
    • isAllUppercase

      public static boolean isAllUppercase(List<String> strList)
      Returns true if the given list of string is made up of all-uppercase words. If the list contains only numbers or punctuation marks it is not considered all-uppercase
    • isMixedCase

      public static boolean isMixedCase(String str)
      Returns true if the given string is mixed case, like MixedCase or mixedCase (but not Mixedcase).
      Parameters:
      str - input str
    • isNotAllLowercase

      public static boolean isNotAllLowercase(String str)
      Returns true if str is not made up of all-lowercase characters (ignoring characters for which no upper-/lowercase distinction exists).
      Since:
      2.5
    • isCapitalizedWord

      @Contract("null -> false") public static boolean isCapitalizedWord(@Nullable String str)
      Parameters:
      str - input string
      Returns:
      true if word starts with an uppercase letter and all other letters are lowercase
    • startsWithUppercase

      public static boolean startsWithUppercase(String str)
      Whether the first character of str is an uppercase character.
    • startsWithLowercase

      public static boolean startsWithLowercase(String str)
      Whether the first character of str is an uppercase character.
      Since:
      4.9
    • allStartWithLowercase

      public static boolean allStartWithLowercase(String str)
    • uppercaseFirstChar

      @Contract("!null -> !null") @Nullable public static String uppercaseFirstChar(@Nullable String str)
      Return str modified so that its first character is now an uppercase character. If str starts with non-alphabetic characters, such as quotes or parentheses, the first character is determined as the first alphabetic character.
    • uppercaseFirstChar

      @Contract("!null, _ -> !null") @Nullable public static String uppercaseFirstChar(@Nullable String str, Language language)
      Like uppercaseFirstChar(String), but handles a special case for Dutch (IJ in e.g. "ijsselmeer" -> "IJsselmeer").
      Parameters:
      language - the language, will be ignored if it's null
      Since:
      2.7
    • collectAllTitleCaseExceptions

      private static Set<String> collectAllTitleCaseExceptions()
    • titlecaseGlobal

      @Contract("!null -> !null") @Nullable public static String titlecaseGlobal(@Nullable String str)
      Title case a string ignoring a list of words. These words are ignored due to titlecasing conventions in the most frequent languages. Differs from convertToTitleCaseIteratingChars(String) in that it is less aggressive, i.e., we do not force titlecase in all caps words (e.g. IDEA does not become Idea). This method behaves the same regardless of the language, and is rather aggressive in its ignoring of words. We can, possibly, in the future, have language-specific titlecasing conventions.
    • lowercaseFirstChar

      @Contract("!null -> !null") @Nullable public static String lowercaseFirstChar(@Nullable String str)
      Return str modified so that its first character is now an lowercase character. If str starts with non-alphabetic characters, such as quotes or parentheses, the first character is determined as the first alphabetic character.
    • lowercaseFirstCharIfCapitalized

      @Contract("!null, -> !null") @Nullable public static String lowercaseFirstCharIfCapitalized(@Nullable String str)
      Return str if str is capitalized isCapitalizedWord(String), otherwise return modified str so that its first character is now a lowercase character.
    • changeFirstCharCase

      @Contract("!null, _ -> !null") @Nullable private static String changeFirstCharCase(@Nullable String str, boolean toUpperCase)
      Return str modified so that its first character is now an lowercase or uppercase character, depending on toUpperCase. If str starts with non-alphabetic characters, such as quotes or parentheses, the first character is determined as the first alphabetic character.
    • readerToString

      public static String readerToString(Reader reader) throws IOException
      Throws:
      IOException
    • streamToString

      public static String streamToString(InputStream is, String charsetName) throws IOException
      Throws:
      IOException
    • escapeXML

      public static String escapeXML(String s)
    • escapeForXmlAttribute

      public static String escapeForXmlAttribute(String s)
      Since:
      2.9
    • escapeForXmlContent

      public static String escapeForXmlContent(String s)
      Since:
      2.9
    • escapeHTML

      public static String escapeHTML(String s)
      Escapes these characters: less than, greater than, quote, ampersand.
    • trimWhitespace

      public static String trimWhitespace(String s)
      Filters any whitespace characters. Useful for trimming the contents of token elements that cannot possibly contain any spaces, with the exception for a single space in a word (for example, if the language supports numbers formatted with spaces as single tokens, as Catalan in LanguageTool).
      Parameters:
      s - String to be filtered.
      Returns:
      Filtered s.
    • trimSpecialCharacters

      public static String trimSpecialCharacters(String s)
      eliminate special (unicode) characters, e.g. soft hyphens
      Parameters:
      s - String to filter
      Returns:
      s, with non-(alphanumeric, punctuation, space) characters deleted
      Since:
      4.3
    • addSpace

      public static String addSpace(String word, Language language)
      Adds spaces before words that are not punctuation.
      Parameters:
      word - Word to add the preceding space.
      language - Language of the word (to check typography conventions). Currently French convention of not adding spaces only before '.' and ',' is implemented; other languages assume that before ,.;:!? no spaces should be added.
      Returns:
      String containing a space or an empty string.
    • isWhitespace

      public static boolean isWhitespace(String str)
      Checks if a string contains a whitespace, including:
      • all Unicode whitespace
      • the non-breaking space (U+00A0)
      • the narrow non-breaking space (U+202F)
      • the zero width space (U+200B), used in Khmer
      Parameters:
      str - String to check
      Returns:
      true if the string is a whitespace character
    • isNonBreakingWhitespace

      public static boolean isNonBreakingWhitespace(String str)
      Checks if a string is the non-breaking whitespace ( ).
      Since:
      2.1
    • isPositiveNumber

      public static boolean isPositiveNumber(char ch)
      Parameters:
      ch - Character to check
      Returns:
      True if the character is a positive number (decimal digit from 1 to 9).
    • isEmpty

      public static boolean isEmpty(@Nullable String str)
      Helper method to replace calls to "".equals().
      Parameters:
      str - String to check
      Returns:
      true if string is empty or null
    • filterXML

      public static String filterXML(String str)
      Simple XML filtering for XML tags.
      Parameters:
      str - XML string to be filtered.
      Returns:
      Filtered string without XML tags.
    • hasDiacritics

      public static boolean hasDiacritics(String str)
    • removeDiacritics

      public static String removeDiacritics(String str)
    • normalizeNFKC

      public static String normalizeNFKC(String str)
    • normalizeNFC

      public static String normalizeNFC(String str)
    • preserveCase

      public static String preserveCase(String inputString, String modelString)
      Apply to inputString the casing of modelString
      Parameters:
      inputString - , modelString
      Returns:
      string
    • asString

      @Nullable public static String asString(CharSequence s)
    • isParagraphEnd

      public static boolean isParagraphEnd(String sentence, boolean singleLineBreaksMarksPara)
      Since:
      4.3
    • loadLines

      public static List<String> loadLines(String path)
      Deprecated.
      use DataBroker#getFromResourceDirAsLines(java.lang.String) instead (NOTE: it won't handle comments)
      Loads file, ignoring comments (lines starting with #).
      Parameters:
      path - path in resource dir
      Since:
      4.6
    • toId

      public static String toId(String input, Language language)
      Will turn a string into a typical rule ID, i.e. uppercase and "_" instead of spaces. All non-ASCII characters are replaced with "_", EXCEPT for Latin-1 ranges U+00C0-U+00D6 and U+00D8-U+00DE. "de" locales have a special implementation (ä => ae, etc.).
      Parameters:
      language - LT language object, used to apply language-specific normalisation rules.
      Since:
      5.1
    • isCamelCase

      public static boolean isCamelCase(String token)
      Whether the string is camelCase. Works only with ASCII input and with single words.
      Since:
      5.3
    • isPunctuationMark

      public static boolean isPunctuationMark(String input)
      Whether the string is a punctuation mark
      Since:
      5.5
    • isNotWordCharacter

      public static boolean isNotWordCharacter(String input)
      Whether the string is a punctuation mark
      Since:
      6.1
    • getDifference

      public static List<String> getDifference(String s1, String s2)
      Difference between two strings (only one difference)
      Returns:
      List of strings: 0: common string at the start; 1: diff in string1; 2: diff in string2; 3: common string at the end
      Since:
      6.2
    • makeWrong

      public static String makeWrong(String s)
    • removeTashkeel

      public static String removeTashkeel(String str)
      Return str without tashkeel characters
      Parameters:
      str - input str
    • isNotWordString

      public static boolean isNotWordString(String input)
    • numberOf

      public static int numberOf(String s, String t)
    • convertToTitleCaseIteratingChars

      public static String convertToTitleCaseIteratingChars(String text)
    • isEmoji

      public static boolean isEmoji(String word)
      Checks whether a given String is an Emoji with a string length larger 1.
      Parameters:
      word - to be checked
      Since:
      6.4
    • stringForSpeller

      public static String stringForSpeller(String s)
    • splitCamelCase

      public static String[] splitCamelCase(String input)
    • splitDigitsAtEnd

      public static String[] splitDigitsAtEnd(String input)
    • isAnagram

      public static boolean isAnagram(String string1, String string2)
    • isNumeric

      public static boolean isNumeric(String string)