Big Data tools, clearly, are proliferating quickly in response to major demand. In the decade since Bid Data emerged as a concept and business strategy, thousands of tools have emerged to perform various tasks and processes, all of them promising to save you time, money and uncover business insights that will make you money. Clearly, Big Data analytics tools are enjoying a growing market.
Many of them started out like the initial Big Data software framework, Hadoop, as open source projects, but commercial entities have sprung up rapidly to provide either new tools or commercial support and development for the open source products.
Weeding through them all can be a challenge especially since many Big Data tools have a single purpose and you can do many different things with Big Data, so your analytics toolbox can get rather filled up. We’ll run down a list of major Big Data analytics tools and then three major categories to keep in mind, as recommended by an expert consultant in this field.
Major Big Data Tools
As said earlier, Big Data tools tend to fall into a single use category and there are multiple ways to use Big Data. So we will break things down by category, then analytics tools in each.
Data Storage and Management
Big Data all starts with the data store. That means starting with Hadoop, the Big Data framework. It’s an open-source software framework run by the Apache Foundation for distributed storage of very large datasets on commodity computer clusters.
Storage, obviously, is critical because of the massive volume of information needed for Big Data. But more than that, there needs to be some way to corral all that data into some kind of formation/governance structure that will yield insight. So Big Data storage and management is truly foundational – an analytics platform goes nowhere without it. In some cases, these solutions include staff training.
Major players in this field are:
Essentially Hadoop with some extra services added on, which you will need because Big Data is not a trivial exercise. Cloudera’s services team can not only help you build your Big Data cluster but help train your people to better access to the data as well.
A company with a broad array of solutions, Talend’s offering is built around its Integration Platform, which combines big data, cloud, application, and real-time data integration, data preparation and master data management.
Talend Big Data integration includes data quality and governance features.
Before you can really process the data for insights, you need to clean it up, transform it, and turn it into something remotely searchable. Big Data sets tend to be unstructured and unorganized, so some kind of cleaning or transformation is necessary.
Data cleaning is ever more necessary in this age where data can come from anywhere: mobile, IOT, social media. Not all of this data is easily "cleaned" to yield its insights, so a good data cleaning tool can make all the difference. In fact, in the years ahead look for effectively cleaned data to be a competitive differentiator between acceptable Big Data systems and those that are truly excellent.
OpenRefine is an easy-to-use open source tool for cleaning up messy data by removing duplicates, empty fields and other errors. It’s open source but has a sizable community around it who will help.
Like OpenRefine, DataCleaner transforms semi-structured data sets into clean, readable data sets that data visualization tools can read. The company also offers data warehousing and data management services.
Seriously, it has its uses. You can import data from a wide variety of data sources. Excel is particularly good with manual data entry and copy/paste operations. It can remove duplications, do find and replace, spell check, and has a number of formulas for transforming data. But it gets bogged down quickly and is not ideal for large data sets.
Once data is cleaned and prepared for examination, you begin the search process through data mining. This is where you do the actual process of discovery, making decisions and predictions.
Data mining is, in many ways, the true core of the Big Data process. A data mining solution is often fabulously complex under the hood, but strives to offer an visually-appealing, user-friendly user interface – easier said than done. The other challenge with data mining tools: they do require humans to develop the queries, so a data mining tool is no better than the professional who's using it.
RapidMiner is an easy-to-use predictive analysis tool with a very user-friendly visual interface that means you don’t have to write code to run the analytics products.
IBM SPSS Modeler
IBM SPSS Modeler is a suite of five products for data mining meant for enterprise-scale advanced analytics. Plus IBM services and consulting are second to none.
Teradata offers end-to-end solutions for data warehousing, Big Data and analytics and marketing applications. This all means that you can truly become a data-driven business, along with business services, consulting, training and support.
Like many current Big Data tools, the RapidMiner solution embraces the cloud.
Data visualization is how your data is displayed in a readable, usable format. It’s where you see charts and graphs and other images that put data into perspective.
The visualization of data is as much of an art form as a science. As Big Data moves from the C-suite, with its bevy of supporting data scientist, to the company at large, it's highly important that the visualization be accessible to a wide array of staffers. Sales reps, IT support, mid-level management – each of these teams needs to be able to make sense of it, so the emphasis is on usability. However, an easily readable visualization is sometimes at odds with a readout from a deep feature set, which creates one of the primary challenges of data visualization tools.
The leader in this field, its data visualization tool focus on business intelligence to create all kinds of maps, charts, plots and more without the need to know programming. They have five products overall, with a free version called Tableau Public for potential customers to experiment with.
A simpler version of Tableau, Silk lets you visualize data as maps and charts without requiring any programming. It even tries to visualize your data automatically when you first load it. It also makes it easy to publish results online.
Chartio uses its own visual query language to create powerful dashboards with just a few clicks without having to know SQL or other modeling languages. It’s main difference from others is that you connect directly to databases, so no data warehouse is needed in between.
IBM Watson Analytics
IBM Watson Analytics is a combination of machine learning (ML) and artificial intelligence (AI) helps provide a smart data science assistant, which acts as a guide for users with a wide range of data science skill sets, from business analyst to data scientist.
Three Levels of Big Data Tools
In terms of level of sophistication and market strategy, Big Data tools break down into a three-level pyramid, says Ritesh Ramesh, CTO for the mobile data and analytics program at PwC.
Layer One: the largest, is a wide array of open source tools. Every company started this way, like Cloudera and Hortonworks. There is very little value other than the basic infrastructure and servers and storage. Most of the cloud players have commoditized that layer.
Layer Two: This is where most of these vendors have realized to increase their market share they have to build some proprietary apps on top of the open source tools to separate themselves from the rest. Cloudera, for example, built a number of things like the data science platform that sits on the Hadoop core.
Layer Three: These are vertical-specific apps. Most of these companies are working with system integrators like PwC, Cognizant or Accenture. That’s where the real value is – and this is also a highly effective competitive strategy for Big Data tool makers.
Ramesh said there are three major areas of need in tools, beyond the basic functions. The first is data wrangling tools, he said. “Data learning tools are a great tool in the toolkit for clients to do data quality and profiling, to process through 50 million rows of data to find insights,” he said.
The second major category of apps is governance, such as how you have metadata definitions. “A lot of people struggle with that. People dump a lot of junk into the data lake. There are not many tools in the market that can effectively work in the lake. Since a lot of this work is done by IT people they are more interested in pumping data into the lake and not putting a governance structure around it,” he said.
The third biggest need that shows up frequently is security, said Ramesh. “People want a single product with all layers of security access, column, row, and objects. They want one product that supports user access and security for diff data objects. That space is also very green,” he said.