- Tue 31 March 2020
- projects
- #library, #Apache Commons
With this article I'll take a break from large Apache projects to focus again on a single, easy-to-integrate library. Apache Commons Collections is part of the larger Apache Commons project, a fantastic collection of libraries that cover very specific domains or use cases such as CSV files processing, command line arguments parsing, string processing, and many more.
The Collections component was created to add more collections and to expand on the existing Java collections' capabilities at a time when functional programming (especially in Java) was not yet widespread. Some solutions have actually been included in the most recent versions of Java itself (the FluentIterable, for instance, has been "replaced" by Java 8's Stream), while many other still provide useful additions and good alternatives to other widespread libraries such as Guava. The collection types that first brought me to Apache Commons Collections are the MultiSet
and the MultiMap
, also because they have a native counterpart in Python, and the LRUMap
as a simple cache; I also found the SetUtils
extremely useful, since they added some mathematical (and functional) foundation to operations between sets. Let's go through some examples and see!
Setting up
The latest version of Commons Collections is 4.4, which is based on Java 8; therefore, I will use Java 8 in the following. Although there are newer Java versions, the 8 is the oldest long-term support (LTS) version and is still prevalent in the Big Data world; it makes then for a good balance between "legacy" versions and the newest versions. I will not make use of lambdas in most examples, in order to make the code more readable and all the involved types explicit.
I find that the easiest way to get started with a Java library is by creating a Maven project. All the major IDEs support the creation of Maven projects, so that all you need to do is to give a name to your project and edit the pom.xml
file to include lines like the following:
<groupId>blog.apothem</groupId>
<artifactId>apache-commons-collection-example</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-collections4</artifactId>
<version>4.4</version>
</dependency>
</dependencies>
You are then ready to build the project and start tinkering with the examples. For a very easy start, I would just create a class with a main
method and write all the code there; otherwise, another very good approach is to make the examples as test classes by using a testing framework such as JUnit, but since this would require a bit more setup and some knowledge of testing, I would leave this as an "exercise for the reader". If you just want to grab a ready project with all the examples and run it, check out the Apothem resources repository!
Collection types
Let's now see what types of collection are available. First of all, a general remark that the project makes on its user guide: most collections are not thread-safe unless said otherwise, so they should be synchronized
accordingly when thread safety is needed.
Bag
Bag
s are collections where each element can appear many times. We will skip them because the MultiSet
type is set to replace them (see the release notes of version 4.1).
BidiMap
A BidiMap
is a bidirectional map, namely a map where the key-to-value relation can be inverted, and the values effectively be used as keys in the inverted map. This collection type enables, among the other things, the creation of mappings between IDs as in the following example:
BidiMap<String, Integer> bidiMap = new DualHashBidiMap<>();
bidiMap.put("string-id-1", 1);
bidiMap.put("string-id-3", 3);
bidiMap.put("string-id-2", 2);
System.out.println(String.format(
"Using key '%s' to retrieve value %d", "string-id-1", bidiMap.get("string-id-1")));
System.out.println(String.format(
"Using value %d to retrieve key '%s'", 1, bidiMap.getKey(1)));
Using key 'string-id-1' to retrieve value 1
Using value 1 to retrieve key 'string-id-1'
Here we have used a DualHashBidiMap
, namely a BidiMap
backed by two HashMap
s, but like for the standard Map
we have three different implementations:
BidiMap<String, Integer> bidiMap = new DualHashBidiMap<>();
BidiMap<String, Integer> bidiMap = new DualLinkedHashBidiMap<>();
BidiMap<String, Integer> bidiMap = new DualTreeBidiMap<>();
If we need to use TreeMap
s, there is actually a fourth implementation which is especially optimized for this use case:
BidiMap<String, Integer> bidiMap = new TreeBidiMap<>();
All the implementations offer the same methods, notably:
-
get
to get a value by its key (as in the standardMap
);System.out.println(String.format( "Using key '%s' to retrieve value %d", "string-id-1", bidiMap.get("string-id-1")));
Using key 'string-id-1' to retrieve value 1
-
getKey
to get a key by its value, in "reverse" with respect to the common usage;System.out.println(String.format( "Using value %d to retrieve key '%s'", 1, bidiMap.getKey(1)));
Using value 1 to retrieve key 'string-id-1'
-
entrySet
to get all theMap.Entry
key-value pairs (note: heregetKey
is the usual method ofMap.Entry
):for (Map.Entry<String, Integer> entry : bidiMap.entrySet()) System.out.println(String.format( "Key: '%s', value: %d", entry.getKey(), entry.getValue()));
Key: 'string-id-2', value: 2 Key: 'string-id-3', value: 3 Key: 'string-id-1', value: 1
-
inverseBidiMap
to get the inverted map, as in the following example:for (Map.Entry<Integer, String> entry : bidiMap.inverseBidiMap().entrySet()) System.out.println(String.format( "Key: %d, value: '%s'", entry.getKey(), entry.getValue()));
Key: 1, value: 'string-id-1' Key: 2, value: 'string-id-2' Key: 3, value: 'string-id-3'
One thing to be careful about is that BidiMap
s do not strictly check for value uniqueness, so inverting a BidiMap
with repeated values can lead to unexpected behaviour (see a "bad BidiMap" example on the repo).
MultiMap
A MultiMap
is a Map
whose values are collections (lists or sets) of values, similar to Python's defaultdict. When instantiated with the default constructor, the map is initialized with a capacity of 16 and each list or set with a capacity of 3:
ArrayListValuedHashMap<String, String> multiMap = new ArrayListValuedHashMap<>();
Other constructors are available to manually set both capacities:
int initialMapCapacity = 32;
int initialListCapacity = 10;
ArrayListValuedHashMap<String, String> multiMap = new ArrayListValuedHashMap<>(initialListCapacity);
ArrayListValuedHashMap<String, String> multiMap = new ArrayListValuedHashMap<>(initialMapCapacity, initialListCapacity);
The elements can be inserted in the usual way, but inserting more than one element with the same key does not result in overwriting:
multiMap.put("language", "Java");
multiMap.put("language", "Python");
multiMap.put("language", "C++");
multiMap.put("skill", "Good");
multiMap.put("skill", "Excellent");
for (String key : multiMap.keySet())
System.out.println(String.format(
"Key '%s' contains values: %s", key, String.join(", ", multiMap.get(key))));
Key 'skill' contains values: Good, Excellent
Key 'language' contains values: Java, Python, C++
The keySet
method here is used to obtain the unique keys, whereas the keys
method is used to obtain all the keys as a MultiSet
(with counts):
System.out.println(String.format(
"Keys in the map (with counts): %s", multiMap.keys()));
System.out.println(String.format(
"Unique keys in the map: %s", multiMap.keySet()));
Keys in the map (with counts): [skill:2, language:3]
Unique keys in the map: [skill, language]
Since there are actually five "pairs", the map size is 5
:
System.out.println(String.format(
"The map has %s elements", multiMap.size()));
The map has 5 elements
This is also reflected by listing all the values in the map using the values
method:
System.out.println(String.format(
"Values in the map: %s", multiMap.values()));
Values in the map: [Good, Excellent, Java, Python, C++]
The examples we have seen so far make use of the ArrayListValuedHashMap
implementation, which arranges the values corresponding to each key in an ArrayList
. The HashSetValuedHashMap
can be used to arrange the values in HashSet
s instead.
MultiSet
A MultiSet
is a Set
that allows for multiple occurrences of each element, keeping track of their count. It is similar to Python's Counter and can be very useful when a count of the occurrences of an object in a set is needed, for instance in text processing applications.
The initialization and the insertion of elements are straightforward:
MultiSet<String> multiSet = new HashMultiSet<>();
multiSet.add("Java");
multiSet.add("programming");
multiSet.add("Python");
multiSet.add("programming");
Let's start with counting the objects in the map:
System.out.println(String.format(
"There are %d elements in the multiset", multiSet.size()));
System.out.println(String.format(
"There are %d unique elements in the multiset", multiSet.uniqueSet().size()));
There are 4 elements in the multiset
There are 3 unique elements in the multiset
Since we have inserted four elements, the total count will be 4
; the unique elements instead are three, since "Python"
is repeated twice. We can iterate through the elements of the set in two different ways:
-
by iterating through the "pairs", getting the element itself and its count from each pair:
for (MultiSet.Entry s : multiSet.entrySet()) System.out.println(String.format("\t'%s' appears %d time(s)", s.getElement(), s.getCount()));
-
by iterating through the unique elements, getting the count using each element as a key:
for (String s : multiSet.uniqueSet()) System.out.println(String.format("\t'%s' appears %d time(s)", s, multiSet.getCount(s)));
In both cases we obtain the following result:
'Java' appears 1 time(s)
'programming' appears 2 time(s)
'Python' appears 1 time(s)
CircularFifoQueue
The Queue
type is already present in Java, so the main addition from Commons Collection is a circular FIFO queue, a fixed size queue where the oldest elements are replaced by the newer elements when it reaches capacity. A CircularFifoQueue
is created with the size as a parameter for the constructor:
int size = 2;
CircularFifoQueue<String> queue = new CircularFifoQueue<>(size);
Let's add one element and check 1) how many elements are there in the queue and 2) whether the queue is at full capacity:
queue.add("A");
System.out.println(String.format(
"Elements: %d out of %d, queue at full capacity?: %b",
queue.size(), queue.maxSize(), queue.isAtFullCapacity()));
Elements: 1 out of 2, queue at full capacity?: false
Now let's insert one more element and check again:
queue.add("B");
System.out.println(String.format(
"Elements: %d out of %d, queue at full capacity?: %b",
queue.size(), queue.maxSize(), queue.isAtFullCapacity()));
Elements: 2 out of 2, queue at full capacity?: true
Since the capacity is set to 2
, with two elements the queue is at full capacity. Let's check what the elements are:
System.out.println(new ArrayList<>(queue));
[A, B]
What if we now add one more element?
queue.add("C");
System.out.println(new ArrayList<>(queue));
[B, C]
Obviously the queue is still at full capacity, but the first element "A"
has been dropped to make room for the new element "C"
.
Besides the methods just shown, CircularFifoQueue
has the same methods of a standard Queue
(including poll
and remove
to remove and return the head of the queue, and element
and peek
to return the head of the queue without removing it); there is also an isFull
method to determine whether the queue is full, which always returns false
since a circular queue can never be full.
Trie
A Trie
implements a rather interesting data structure called trie (from retrieval), which is mostly used for search using prefixes. A trie is similar to a tree with an important difference: a node of the trie does not contain the value that is being searched, but the path to the node does; as such, tries are very good at prefix search. The trie implementation that Commons Collections offers is called PatriciaTree
(from the acronym PATRICIA, "Practical Algorithm to Retrieve Information Coded in Alphanumeric"), a type of compressed trie. Let's see an example.
Trie<String, Integer> trie = new PatriciaTrie<>();
trie.put("car", 1);
trie.put("cart", 2);
trie.put("carton", 3);
trie.put("cartridge", 4);
trie.put("cartography", 5);
trie.put("caravan", 6);
trie.put("carbon", 7);
trie.put("castle", 8);
trie.put("coal", 9);
Here we just created a PatriciaTree
and put several keys that share common prefixes; it is important to note that the type of the keys of a PatriciaTree
should extend String
. Let's suppose we want to find the values associated to all the keys starting with "car"
; we can use the prefixMap
method:
System.out.println(String.format(
"Values for prefix '%s': %s", prefix, trie.prefixMap("car")));
Values for prefix 'car': {car=1, caravan=6, carbon=7, cart=2, cartography=5, carton=3, cartridge=4}
If we want to see the keys preceding or following a specific key, we can use the headMap
and tailMap
methods:
System.out.println(String.format(
"Keys before 'cart': %s", trie.headMap("cart")));
System.out.println(String.format(
"Keys after 'cart': %s", trie.tailMap("cart")));
Keys before for 'cart': {car=1, caravan=6, carbon=7}
Keys after 'cart': {cart=2, cartography=5, carton=3, cartridge=4, castle=8, coal=9}
Furthermore, if we want to see the keys between any two entries, we can use the subMap
method:
System.out.println(String.format(
"Keys between 'cart' and 'carton': %s", trie.subMap("cart", "carton")));
Keys between 'cart' and 'carton': {cart=2, cartography=5}
LRUMap
A LRUMap
is basically a limited-size cache where the least recently used (LRU) elements are evicted to leave room for new elements when the cache is full. The LRU policy is quite popular in caches; the concept of "least recently used" may vary, and for instance in Commons Collections it refers to get
and put
operations only: if two elements are inserted one after the other and the former is retrieved once, the latter will become "least recently used". The LRUMap
is initialized with a maxSize
parameter:
int maxSize = 2;
LRUMap<String, Integer> lruMap = new LRUMap<>(maxSize);
Elements are inserted as in a standard Map
:
lruMap.put("value1", 1);
lruMap.put("value2", 2);
and can be iterated through with the usual entrySet
method:
System.out.println(String.format(
"Elements in the map: %s", lruMap.entrySet()));
Elements in the map: [value1=1, value2=2]
If we insert a new element, the "oldest" one (in this case the first inserted, "value1"
) will be removed:
lruMap.put("value3", 3);
System.out.println(String.format(
"Elements in the map: %s", lruMap.entrySet()));
Elements in the map: [value2=2, value3=3]
If we insert one more element but we read "value2"
first, the "oldest" one to be removed will be now "value3"
:
lruMap.get("value2");
lruMap.put("value4", 4);
System.out.println(String.format(
"Elements in the map: %s", lruMap.entrySet()));
Elements in the map: [value2=2, value4=4]
It's important to note that, in order to use LRUMap
as a cache in a real multithreaded environment, we should make it thread-safe; an easy way is by wrapping it with the Collections.synchronizedMap
method.
MultiKeyMap
A MultiKeyMap
is a Map
where the keys are composite, such as firstName
and lastName
in a map that associates persons to document IDs. It can be instantiated like a standard Map
:
MultiKeyMap<String, String> multiKeyMap = new MultiKeyMap<>();
with the important difference that the key can be actually made of up to 5 fields of the same type (String
in the example). New elements can be added with put
by just inserting all the key components one after the other and before the value:
multiKeyMap.put("John", "Doe", "id1");
multiKeyMap.put("Jane", "Doe", "id2");
and then they can be read with get
in a similar fashion:
System.out.println(String.format(
"Using key '%s': %s", "(\"John\", \"Doe\")", multiKeyMap.get("John", "Doe")));
Using key '("John", "Doe")': id1
This class can be very useful when implementing a structure such as a database table, as it makes it simple to use complex keys. It cannot be used, though, if a key must have components of different types.
Utils
Besides new or refined collection types, Commons Collections offers several utility classes. A few examples:
-
FluentIterator
: aStream
precursor to enable some functional programming capabilities; -
IteratorUtils
andIterableUtils
: similar to Python's itertools module, they add many methods to useIterator
s andIterable
s more efficiently; for instance, given two lists:List<String> list1 = Arrays.asList("one", "two", "three"); List<String> list2 = Arrays.asList("A", "B", "C");
we can concatenate them with the
chainedIterable
method (or its iterator-based counterpartchainedIterator
) or interleave them with thezippingIterable
method (or its iterator-based counterpartzippingIterator
):for (String s: IterableUtils.chainedIterable(list1, list2)) System.out.print(s + " "); System.out.println(); for (String s: IterableUtils.zippingIterable(list1, list2)) System.out.print(s + " ");
one two three A B C one A two B three C
-
ListUtils
: list-specific utilities that enable some set-like operations (e.g. union, intersection, sum, subtraction) on lists:System.out.println(String.format( "Union: %s", ListUtils.union(list1, list2))); System.out.println(String.format( "Intersection: %s", ListUtils.intersection(list1, list2))); System.out.println(String.format( "Sum: %s", ListUtils.sum(list1, list2))); System.out.println(String.format( "Subtract: %s", ListUtils.subtract(list1, list2)));
Union: [one, two, three, one, two, four] Intersection: [one, two] Sum: [three, one, two, four] Subtract: [three]
allow to calculate the longest common subsequence between lists:
System.out.println(String.format( "Longest common subsequence: %s", ListUtils.longestCommonSubsequence(list1, list2)));
Longest common subsequence: [one, two]
and add the capability to create lazy lists:
Factory<Integer> factory = () -> new Random().nextInt(); List<Integer> lazyList = ListUtils.lazyList(new ArrayList<>(), factory); System.out.println(String.format( "Lazy list length: %d", lazyList.size())); lazyList.get(2); System.out.println(String.format( "Lazy list length: %d", lazyList.size()));
Lazy list length: 0 Lazy list length: 3
-
SetUtils
: set-specific utilities that enable some set operations (e.g. union, intersection, disjunction, difference) which, differently from the ones that Java already offers, return newSet
s instead of modifying theSet
s they are applied to:Set<String> set1 = new HashSet<>(Arrays.asList("one", "two", "three")); Set<String> set2 = new HashSet<>(Arrays.asList("one", "four")); System.out.println(String.format( "Union: %s", SetUtils.union(set1, set2))); System.out.println(String.format( "Intersection: %s", SetUtils.intersection(set1, set2))); System.out.println(String.format( "Disjunction: %s", SetUtils.disjunction(set1, set2))); System.out.println(String.format( "Difference: %s", SetUtils.difference(set1, set2)));
Union: [one, two, three, four] Intersection: [one] Disjunction: [two, three, four] Difference: [two, three]
-
CollectionUtils
: generic methods onCollection
s, including decorators to make the underlyingCollection
synchronized, unmodifiable, filtered or validated by a predicate, etc.
Conclusions
We have seen that the Commons Collections project provides a really vast amount of collection types, utilities, and decorators to use with existing collections. The project is very easy to integrate and the documentation, although a bit lacking in examples, is covered by an extensive and detailed Javadoc. If you find yourself making use of many collections, Commons Collections can help you in many ways.