Big Data: 9 Steps to Extract Insight from Unstructured Data

Wednesday Jan 7th 2015 by Guest Author

When Data scientists analyze unstructured data, they need to make sense of disparate data sources.

The increasing digitization of information in recent years, coupled with the proliferation of multi-channel processes and transactions, has resulted in a data deluge. The ever-increasing pace of digital information has led the world's aggregate creation of data to double in even shorter intervals than ever before. According to Gartner, about 80% of data held by an organization is unstructured data, comprised of information from customer calls, emails and social media feeds. This is in addition to the voluminous diagnostic information logged by embedded and user devices. While it would be a daunting to even make a proper analysis from organized data, it is very difficult to make sense of unstructured data.

As a result, organizations have to study both structured and unstructured data to arrive at meaningful business decisions, including determining customer sentiment, cooperating with e-discovery requirements and personalizing their product for their customers. Not only do they have to analyze information provided by consumers and other organizations, information collected from devices must be scrutinized. This must be done not only to ensure that the organization is on top of any network security threats, but to also ensure the proper functioning of embedded devices.

While sifting through vast amounts of information can look like a lot of work, there are rewards. By reading large, disparate sets of unstructured data, one can identify connections from unrelated  data sources and find patterns. What makes this method of analysis extremely effective is that it enables the discovery of trends; traditional methods only work with what is already quantifiable, while looking through unstructured data can cause revelations.

There are nine steps to analyze unstructured data so that one can see more than meets the eye:

1. Make sense of the disparate data sources

Before one can begin, one needs to know what sources of data are important for the analysis. One information channel is log files from devices, but that source won't be of much help when searching for user trends. If the information being analyzed is only tangentially related to the topic at hand, it should be set aside. Instead, only use information sources that are absolutely relevant.

2.  Sign off on the method of analytics and find a clear way to present the results

The analysis is useless if it is not clear what the end result should be. One must understand what sort of answer is needed - is it a quantity, a trend or something else? In addition, one must provide a roadmap for what to do with the results so that they can be used in a predictive analytics engine before undergoing segmentation and integration into the business's information store.

3. Decide the technology stack for data ingestion and storage

Even though the raw data can come from a wide variety of sources, the results of the analysis must be placed in a technology stack or cloud-connected information store so that the results can be easily utilized. Factors that are important for choosing the data storage and data retrieval depend often on the scalability, volume, variety and velocity requirements. A potential technology stack should be well evaluated against the final requirements, after which the information architecture of the project is set.

A few likely influential requirements are that the results of the analysis must be available in real-time, have high availability for access while still functioning in a real-time multi-tenant environment. Real-time access is crucial, as it has become important for e-commerce companies to provide real-time quotes. This requires tracking real-time activities, and providing offerings based on the results of a predictive analytic engine. Technologies that can provide this include Storm, Flume and Lambda. High availability is crucial for ingesting information from social media. The technology platform used must ensure that no loss of data occurs in a real-time stream. It is a good idea to use a messaging queue to hold incoming information as part of a data redundancy plan, such as Apache Kafka. The ability to function in real-time multi-tenancy environments is required if the results are required to avoid state changes and continue to be mutable data.

4. Keep information in a data lake until it has to be stored in a data warehouse.

Traditionally, an organization obtained or generated information, sanitized it and stored it away. For example, if the information source was an HTML file, the text might be stripped and the rest discarded, such that information was lost during storage in a data warehouse.

Anything useful that was discarded in the initial data load was lost as a result, and the only thing one could do with the data was what is possible after extraneous information was stripped away. The appeal of this prior strategy was that the data was in a pristine, mutable format that could be used whenever. However, with the advent of Big Data, it has come into common practice to do the opposite. With a data lake, information is stored in its native format until it is actually deemed useful and needed for a specific purpose, preserving metadata or anything else that might assist in the analysis.

5. Prepare the data for storage

While keeping the original file, if one needs to make use of the data, it is best to clean up a copy. In a text file, there can be a lot of noise or shorthand that can obscure valuable information. It is good practice to cleanse noise like whitespaces and symbols, while converting informal text in strings to formal language. If it is possible to detect the spoken language, it should be categorized as such. Duplicate results should be removed, the dataset treated for missing values, and off-topic information extirpated from the dataset.

6. Retrieve useful information

Through the use of natural language processing and semantic analysis, one can make use of Parts-of-Speech tagging to extract common named entities, such as "person," "organization," "location" and their relationships. From this, one can create a term frequency matrix to understand the word pattern and flow in the text.

7. Ontology evaluation

Through analysis, one can then create the relationships among the sources and the extracted entities so that a structured database can be designed to specifications. This can take time, but the insights provided can be worth it for an organization.

8. Statistical modeling and execution

Once the database has been created, the data must be classified and segmented. It can save time to make use of supervised and unsupervised machine learning, such as the K-means, Logistic Regression, Naïve Bayes, and Support Vector Machine algorithms. These tools can be used to find similarities in customer behavior, targeting for a campaign and overall document classification. The disposition of customers can be determined with sentiment analysis of reviews and feedback, which helps to understand future product recommendations, overall trends and guide introductions of new products and services.

The most relevant topics discussed by customers can be analyzed through temporal modeling techniques, which can extract the topics or events that customers are sharing via social media, feedback forms or any other platform.

9. Obtain insight from the analysis and visualize it

From all the above steps, it all comes down to the end result, whatever it might be. It is crucial that the answers to the analysis are provided in a tabular and graphical format, providing actionable insights for the end-user of the resultant information. To ensure that the information can be used and accessed by the intended parties, it should be rendered in a way that it can be reviewed through a handheld device or web-based tool, so that the recipient can make the recommended actions on a real time or near-real time basis.


New information forms such as social media and machine logs have made themselves crucial to organizations for their ability to provide unique content and diagnostic intelligence once they are properly analyzed. Traditional or conventional data scientists will have to acquire new skills sets to analyze unstructured data. While enterprises develop content intelligence capabilities, the real power lies in fusing different data formats and overlaying structured data with semi and unstructured data sources for insights into the mind of a user or the life of a device.

About the author:

Salil Godika, Chief Strategy & Marketing Officer and Industry Group Head, Happiest Minds

Photo courtesy of Shutterstock.

Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved