The task of transferring and aggregating large data sets for Big Data purposes clearly requires some heavy duty tools. Efficiently indexing a massive data set, for instance, calls for a software solution that can digest hefty levels of GB per hour as if it’s a light snack. And transferring data requires advanced software that can quickly interoperate between today’s complex, default platforms, like Hadoop and the key RDMBSes in use. As you’ll see on the following pages, many of the current leading heavyweight Big Data tools for transferring and aggregating data sets are open source. Clearly, the fact that these powerful tools are open source is testament to the growing dominance of open source in the enterprise.
5 Open Source Big Data Tools: Transfer and Aggregate
A survey of the heavy duty open source tools being used in the enterprise for Big Data transfer and aggregation.
1 / 6
The self-proclaimed "de facto standard for search libraries," Lucene offers very fast indexing and searching for very large datasets. In fact, it can index over 95GB/hour when using modern hardware. Operating System: OS Independent.
2 / 6
Solr is an enterprise search platform based on the Lucene tools. It powers the search capabilities for many large sites, including Netflix, AOL, CNET and Zappos. Operating System: OS Independent.
3 / 6
Sqoop transfers data between Hadoop and RDBMSes and data warehouses. As of March of this year, it is now a top-level Apache project. Operating System: OS Independent.
4 / 6
Another Apache project, Flume collects, aggregates and transfers log data from applications to HDFS. It's Java-based, robust and fault-tolerant. Operating System: Windows, Linux, OS X.
5 / 6
Built on top of HDFS and MapReduce, Chukwa collects data from large distributed systems. It also includes tools for displaying and analyzing the data it collects. Operating System: Linux, OS X.