projects articles

Sun 02 February 2020
projects

Apache Hivemall

With this article we will move a little bit out of the data engineering space and delve into another subject I love: we will explore the world of distributed machine learning with Apache Hivemall. From the project's homepage: Hivemall is a scalable machine learning library that runs on Apache Hive …

Thu 31 October 2019
projects

Apache CarbonData (part 2)

In the previous article we have seen many exciting features that CarbonData offers, but we haven't explored them all; in this article we will try out the streaming capabilities and we will delve a bit deeper into the data layout, looking at concept like compaction and partitioning, and the way …

Mon 30 September 2019
projects

Apache CarbonData

In the last few years I have been working quite extensively with Apache Spark, and I have come to realize that a good storage format goes a long way toward efficiency and speed. For instance, when dealing with large CSV or JSON files, adding an Apache Parquet writing step would …

Sat 31 August 2019
projects

Apache Rya

Since I have been working with Semantic Web technologies for quite some time, I was looking forward to explore new Apache projects within the area. Apache Rya fit the purpose perfectly, as it is a SPARQL-enabled triplestore for Big Data, promising to scale to billions of triples across multiple nodes …

Previous
2 of 3
Next