APOTHEM

Apache Project(s) of the month

Articles


Apache Commons Collections

With this article I'll take a break from large Apache projects to focus again on a single, easy-to-integrate library. Apache Commons Collections is part of the larger Apache Commons project, a fantastic collection of libraries that cover very specific domains or use cases such as CSV files processing, command line …

Apache Hivemall

With this article we will move a little bit out of the data engineering space and delve into another subject I love: we will explore the world of distributed machine learning with Apache Hivemall. From the project's homepage: Hivemall is a scalable machine learning library that runs on Apache Hive …

Apache CarbonData (part 2)

In the previous article we have seen many exciting features that CarbonData offers, but we haven't explored them all; in this article we will try out the streaming capabilities and we will delve a bit deeper into the data layout, looking at concept like compaction and partitioning, and the way …

Apache CarbonData

In the last few years I have been working quite extensively with Apache Spark, and I have come to realize that a good storage format goes a long way toward efficiency and speed. For instance, when dealing with large CSV or JSON files, adding an Apache Parquet writing step would …