APOTHEM

Apache Project(s) of the month

Apache Commons Text


Today we are back to the Apache Commons project with a library for string manipulation and comparison, Apache Commons Text. The library can be a lightweight alternative to larger NLP libraries or a good complement to the powerful StringUtils class from the Apache Commons Lang library. What I really like of Commons Text, besides the "base" operations on strings, is the collection of string similarity measures that allow to quantify how much two strings are close to each other, as we will see in detail in the text.similarity section.

The Commons Text library contains a few packages which we will explore one by one. We will just skip the text.lookup and text.matcher packages because they only contain one very specialized class each, which I do not consider very interesting.

Setting up

The latest version of Commons Text is 1.8, which is based on Java 8; therefore, we will use Java 8 in the code examples. As we have done in the Apache Commons Collections article, we will use the library by means of a pom.xml like the following:

<groupId>blog.apothem</groupId>
<artifactId>apache-commons-text-example</artifactId>
<version>0.0.1-SNAPSHOT</version>

<dependencies>
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-text</artifactId>
        <version>1.8</version>
    </dependency>
</dependencies>

Again, we will create a class with a main method and write all the code there; the complete code is available as always in the Apothem resources repository.

The text package

In the main package we can find a few classes dedicated to alphabet conversion (AlphabetConverter), case change (CaseUtils), escaping (StringEscapeUtils), substitution (StringSubstitutor), tokenization (StringTokenizer), string building (TextStringBuilder) and word utilities (WordUtils). The StringTokenizer just adds some functionalities to Java's StringTokenizer, and I couldn't find a useful use case for the AlphabetConverter; therefore, we won't see any examples of these classes.

CaseUtils

The CaseUtils class contains only one static method toCamelCase, which converts a string to camel case based on a delimiter and optionally capitalizes the first letter:

String string = "java-programming-language";

// Capitalizes the first letter
System.out.println(CaseUtils.toCamelCase(string, true, '-'));
// Does not capitalize the first letter
System.out.println(CaseUtils.toCamelCase(string, false, '-'));
JavaProgrammingLanguage
javaProgrammingLanguage

StringEscapeUtils

The StringEscapeUtils class contains several methods to escape a string given a language such as HTML or XML, or a format like CSV:

String string = "<a>Department, R&D</a>";

System.out.println(StringEscapeUtils.escapeHtml4(string));
System.out.println(StringEscapeUtils.escapeXml11(string));
System.out.println(StringEscapeUtils.escapeCsv(string));
<a>Department, R&amp;D</a>
<a>Department, R&amp;D</a>
"<a>Department, R&D</a>"

Escaping can also be included in a string builder to interactively escape only some parts of a string:

System.out.println(
        StringEscapeUtils
                .builder(StringEscapeUtils.ESCAPE_HTML4)
                .append("R&D dept: ")
                .escape(string)
                .toString());
R&D dept: <a>Department, R&amp;D</a>

StringSubstitutor

The StringSubstitutor is a different take on string interpolation than String.format(), closer to the approach that Scala and Kotlin use; it requires the creation of a map containing the variables to substitute in a template string, which can be used both with a static method and with a StringSubstitutor instance:

Map<String, String> substitutions = new HashMap<>();
substitutions.put("city", "London");
substitutions.put("country", "England");

// With static method
System.out.println(
        StringSubstitutor.replace("${city} is the capital of ${country}", substitutions));

// With StringSubstitutor object
StringSubstitutor sub = new StringSubstitutor(substitutions);
System.out.println(sub.replace("${city} is the capital of ${country}"));
London is the capital of England

The StringSubstitutor can also create an interpolator, a lookup-based substitutor that can calculate values such as the Base64 encoding of a string:

StringSubstitutor interpolator = StringSubstitutor.createInterpolator();
System.out.println(interpolator.replace("Base64 encoder: ${base64Encoder:Secret password}"));
Base64 encoder: U2VjcmV0IHBhc3N3b3Jk

and many more (check out the docs for a complete list).

TextStringBuilder

This class can be considered as an "enriched" StringBuilder that adds more methods, for instance to append some fixed-length padding to the string that is being created:

String header = "Header";

TextStringBuilder builder = new TextStringBuilder();
System.out.println(builder.appendPadding(header.length(), '-').toString());
Header
------

and much more (e.g. to delete and replace characters, add separators, etc.).

WordUtils

The WordUtils class contains methods for:

  • word/sentence abbreviation:

    String longString = "This is a very long string, from https://www.example.org";
    
    // Take at least 9 characters, cutting to 12 characters if no space is found before
    System.out.println(WordUtils.abbreviate(longString, 9, 12, " ..."));
    // Take at least 10 characters, cutting to 12 characters if no space is found before
    System.out.println(WordUtils.abbreviate(longString, 10, 12, " ..."));
    // Take at least 10 characters, then cut on the first space wherever it is
    System.out.println(WordUtils.abbreviate(longString, 10, -1, " ..."));
    
    This is a ...
    This is a ve ...
    This is a very ...
    
  • initials extraction:

    String allLower = "all lower but ONE";
    
    System.out.println(WordUtils.initials(allLower));
    
    albO
    
  • case change (strange enough, with the exception of camel case conversion that, as we have seen, is in its own CaseUtils class):

    String allLower = "all lower but ONE";
    String allCapitalized = "All Capitalized But ONE";
    
    // Doesn't lowercase the uppercase characters
    System.out.println(WordUtils.capitalize(allLower));
    // Lowercases everything, then capitalizes the first letter of each word
    System.out.println(WordUtils.capitalizeFully(allLower));
    // Lowercases the first letter of each word
    System.out.println(WordUtils.uncapitalize(allCapitalized));
    // Swaps the case of each character
    System.out.println(WordUtils.swapCase(allLower));
    
    All Lower But ONE
    All Lower But One
    all capitalized but oNE
    ALL LOWER BUT one
    
  • paragraph wrapping:

    String longString = "This is a very long string, from https://www.example.org";
    
    // Line length is 10, uses '\n' as a line break, does not break words longer than the line
    System.out.println(
            WordUtils.wrap(longString, 10, "\n", false) + "\n");
    
    // Line length is 10, uses '\n' as a line break, breaks words longer than the line
    System.out.println(
            WordUtils.wrap(longString, 10, "\n", true) + "\n");
    
    // Line length is 10, uses '\n' as a line break, breaks words longer than the line, also breaks on commas
    System.out.println(
            WordUtils.wrap(longString, 10, "\n", true, ",") + "\n");
    
    This is a
    very long
    string,
    from
    https://www.example.org
    
    This is a
    very long
    string,
    from
    https://ww
    w.example.
    org
    
    This is a 
    very long 
    string
     from http
    s://www.ex
    ample.org
    

The text.diff package

This package features some classes to show how a diff between two strings is calculated. In particular, the StringComparator class can be used to create a script that transforms one string into another:

String s1 = "hyperspace";
String s2 = "cyberscape";

StringsComparator comparator = new StringsComparator(s1, s2);
EditScript<Character> script = comparator.getScript();

System.out.println("Longest Common Subsequence length (number of \"keep\" commands): " +
        script.getLCSLength());
System.out.println("Effective modifications (number of \"insert\" and \"delete\" commands): " +
        script.getModifications());
Longest Common Subsequence length (number of "keep" commands): 6
Effective modifications (number of "insert" and "delete" commands): 8

and by implementing a visitor class we can see each insert, delete and keep operation (see the code in the repository for an example implementation).

The text.translate package

In this package we can find classes for translating the characters of a string into different characters, for instance via lookup to transform a string in its leetspeak equivalent:

Map<CharSequence, CharSequence> translation = new HashMap<>();
translation.put("e", "3");
translation.put("l", "1");
translation.put("t", "7");

String s1 = "Let it be!";

LookupTranslator lookupTranslator = new LookupTranslator(translation);
L37 i7 b3!

or to convert a string into Unicode characters and viceversa:

String s1 = "Let it be!";

UnicodeEscaper unicodeEscaper = new UnicodeEscaper();
UnicodeUnescaper unicodeUnescaper = new UnicodeUnescaper();

String unicodeString = unicodeEscaper.translate(s1);

System.out.println(unicodeString);
System.out.println(unicodeUnescaper.translate(unicodeString));
\u004C\u0065\u0074\u0020\u0069\u0074\u0020\u0062\u0065\u0021
Let it be!

The text.similarity package

The classes contained in this package let us perform some basic data integration tasks by giving us a measure of how close two strings are; this is done by introducing the concepts of string distance and string similarity.

A string distance is a metric that obeys the usual metric axioms:

  • d(a, b) >= 0 (non-negativity axiom: a distance cannot be negative);
  • d(a, b) = 0 if and only if a = b (identity axiom);
  • d(a, b) = d(b, a) (symmetry axiom);
  • triangle inequality: d(a, c) ≤ d(a, b) + d(b, c).

A string similarity measure is more "relaxed" than a metric distance since it does neither enforce the triangle inequality nor the identity axiom. Keeping this in mind, the Jaro-Winkler distance, for instance, is not strictly a metric distance but rather a similarity measure; nevertheless, it is common to refer to it as a distance in terms of "complement of similarity", i.e. distance = 1 - similarity. Another thing worth noting is that all the distances and similarities in Commons Text are based on characters except for the cosine distance and similarity, which is based on words (or tokens).

There are many things to consider when choosing what the most appropriate similarity measure could be for a certain use case, although some general suggestions can be made; furthermore, for even more string similarity goodness, other Java libraries such as this one should be checked out.

String similarity

In the following examples we will use the two strings s1 = "hyperspace" and s2 = "cyberscape". The available similarity measures are:

  • the Jaccard similarity, which captures the number of common characters (here a, c, e, p, r, s, y) divided by the number of all characters (here a, b, c, e, h, p, r, s, y), hence 7 / 9 = 0.777777...:

    JaccardSimilarity jaccard = new JaccardSimilarity();
    System.out.println("Jaccard similarity: " + jaccard.apply(s1, s2));
    
    Jaccard similarity: 0.7777777777777778
    
  • the Jaro-Winkler similarity, which takes into account both matching characters and transpositions correcting them with a scaling factor:

    JaroWinklerSimilarity jaroWinkler = new JaroWinklerSimilarity();
    System.out.println("Jaro-Winkler similarity: " + jaroWinkler.apply(s1, s2));
    
    Jaro-Winkler similarity: 0.8250000000000001
    
  • the longest common subsequence (LCS), which finds the longest sequence of characters (not necessarily adjacent), which in this case would be y, e, r, s, a, e of length 6:

    LongestCommonSubsequence lcs = new LongestCommonSubsequence();
    System.out.println("Longest Common Subsequence similarity: " + lcs.apply(s1, s2));
    
    Longest Common Subsequence similarity: 6
    
  • the fuzzy score, a custom matching algorithm based on a specific locale (for lowercasing), where one point is given for every matched character and subsequent matches yield two bonus points; the second string is treated as a query on the first string, so long common character sequences get a higher score:

    FuzzyScore fuzzyScore = new FuzzyScore(Locale.ENGLISH);
    System.out.println("Fuzzy score similarity: " + fuzzyScore.fuzzyScore(s1, s2));
    System.out.println("Fuzzy score similarity: " + fuzzyScore.fuzzyScore(s1, "space"));
    
    Fuzzy score similarity: 1
    Fuzzy score similarity: 13
    

Token similarity

The similarity measures we have seen so far can be used on any string because they are based on single characters; what if instead we want to measure the similarity between sentences, i.e. combinations of words (or tokens) separated by a delimiter such as whitespace? Although a string itself can be seen as a sequence of one-character tokens, tokens usually consist of few characters; therefore, a token distance will look at tokens as the base elements to compare.

The only token similarity measure that Commons Text offers out of the box is the cosine similarity, which works on vectors of tokens (along with their counts) by multiplying them using the dot product and dividing the result by the product of their norms. In order to understand how it works, we need to tokenize and transform a sentence into vectors:

String s1 = "string similarity";
String s2 = "string distance";

Map<CharSequence, Integer> vector1 = new HashMap<>();
Map<CharSequence, Integer> vector2 = new HashMap<>();

for (String token : s1.split(" ")) {
    vector1.put(token, vector1.getOrDefault(token, 0) + 1);
}

for (String token : s2.split(" ")) {
    vector2.put(token, vector2.getOrDefault(token, 0) + 1);
}

Once we have the two vectors, we can just call the cosineSimilarity method from a CosineSimilarity instance:

CosineSimilarity cosine = new CosineSimilarity();
System.out.println("Cosine similarity: " + cosine.cosineSimilarity(vector1, vector2));
Cosine similarity: 0.4999999999999999

The result (besides rounding errors) is 0.5, because we have:

  • two vectors of length 3 (because there are 3 different tokens);
  • each vector has two elements whose value is 1 (as each token appears once in each sentence) and one element whose value is 0 (e.g. "similarity" is not included in the second sentence, and "distance" is not included in the first);
  • the dot product of the vectors [1, 1, 0] and [1, 0, 1] is equal to 1;
  • the norm of each vector (square root of the sum of each element squared) is equal to the square root of 2 (about 1.414), so their product is equal to 2;
  • the cosine similarity, which is the dot product divided by the product of the norms, is 1 / 2 = 0.5.

If we add one more "string" token to the second vector (thus representing a sentence like "string distance string") the cosine similarity will change, because the second vector will become [2, 0, 1] with norm equal to the square root of 5 (about 2.236), and the dot product will become 2, hence the cosine similarity 2 / (1.414 * 2.236) = 0.632...:

vector2.put("string", vector2.getOrDefault("string", 0) + 1);
System.out.println("Cosine similarity: " + cosine.cosineSimilarity(vector1, vector2));
Cosine similarity: 0.6324555320336759

String distance

Given again the two strings s1 = "hyperspace" and s2 = "cyberscape", we have the following string distances:

  • the Hamming distance, i.e. the number of different characters (which requires the strings to be compared to be of equal length):

    HammingDistance hamming = new HammingDistance();
    System.out.println("Hamming distance: " + hamming.apply(s1, s2));
    
    Hamming distance: 4
    
  • the Jaccard distance, defined as 1 - Jaccard similarity, hence 1 - 0.777777... = 0.222222...:

    JaccardDistance jaccard = new JaccardDistance();
    System.out.println("Jaccard distance: " + jaccard.apply(s1, s2));
    
    Jaccard distance: 0.2222222222222222
    
  • the Jaro-Winkler distance, defined as 1 - Jaro-Winkler similarity, hence it should be 1 - 0.825 = 0.175 but it is not at the moment (see the Jira ticket):

    JaroWinklerDistance jaroWinkler = new JaroWinklerDistance();
    System.out.println("Jaro-Winkler distance: " + jaroWinkler.apply(s1, s2));
    
    Jaro-Winkler distance: 0.8250000000000001
    
  • the longest common subsequence (LCS) distance, defined as the difference between the sum of the lengths of the compared strings minus twice their LCS similarity, hence 10 + 10 - 2 * 6 = 8:

    LongestCommonSubsequenceDistance lcs = new LongestCommonSubsequenceDistance();
    System.out.println("Longest Common Subsequence distance: " + lcs.apply(s1, s2));
    
    Longest Common Subsequence distance: 8
    
  • the well-known Levenshtein distance, which takes into account character additions, deletions and substitutions:

    LevenshteinDistance levenshtein = new LevenshteinDistance();
    System.out.println("Levenshtein distance: " + levenshtein.apply(s1, s2));
    
    Levenshtein distance: 4
    

    and can be parameterized with a threshold above which the distance will be returned as -1:

    LevenshteinDistance levenshteinWithThreshold = new LevenshteinDistance(3);
    // Returns -1 since the actual distance, 4, is higher than the threshold
    System.out.println("Levenshtein distance: " + levenshteinWithThreshold.apply(s1, s2));
    
    Levenshtein distance: -1
    

    or used with its Detailed version to retrieve the details of the calculation:

    LevenshteinDetailedDistance levenshteinDetailed = new LevenshteinDetailedDistance();
    System.out.println("Levenshtein detailed distance: " + levenshteinDetailed.apply(s1, s2));
    
    Levenshtein detailed distance: Distance: 4, Insert: 0, Delete: 0, Substitute: 4
    

Token distance

The only available token distance is the cosine distance, simply defined as 1 - cosine similarity. Building from the previous examples with s1 = "string similarity" and s2 = "string distance", where the cosine similarity were 0.5 and about 0.632 respectively, we will have a cosine distance of 0.5 and about 0.368 respectively:

CosineDistance cosine = new CosineDistance();
System.out.println("Cosine distance: " + cosine.apply(s1, s2));
System.out.println("Cosine distance: " + cosine.apply(s1, s2 + " string"));
Cosine distance: 0.5000000000000001
Cosine distance: 0.3675444679663241

The CosineDistance class tokenizes the sentences before calculating the similarity; the tokenization process is similar to what we did in our token similarity example (although it uses a different regex).

Conclusions

Commons Text is a very feature-rich library for string processing, and should be part of everybody's toolkit since it can save from reinventing the wheel in many occasions. It may sometimes be difficult to understand the design choices behind the grouping of some classes and methods, but the library is actively developed and different solutions may be added in future releases.