Big Data Architecture

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Big data architecture is the foundation for big data analytics. Think of big data architecture as an architectural blueprint of a large campus or office building. Architects begin by understanding the goals and objectives of the building project, and the advantages and limitations of different approaches. It’s not an easy task, but it’s perfectly doable with the right planning and tools.

System architects go through a similar process to plan big data architecture. They meet with stakeholders to understand company objectives for its big data, and plan the computing framework with appropriate hardware and software, data sources and formats, analytics tools, data storage decisions, and results consumption.

If you’re in the market for big data tools, see our list of the top big data companies.

Do I Need Big Data Architecture?

Not everyone does need to leverage big data architecture. Single computing tasks rarely top more than 100GB of data, which does not require a big data architecture. Unless you are analyzing terabytes and petabytes of data – and doing it consistently — look to a scalable server instead of a massively scale-out architecture like Hadoop. If you need analytics, then consider a scalable array that offers native analytics for stored data.

You probably do need big data architecture if any of the following applies to you:

You want to extract information from extensive networking or web logs.
You process massive datasets over 100GB in size. Some of these computing tasks run 8 hours or longer.
You are willing to invest in a big data project, including third-party products to optimize your environment.
You store large amounts of unstructured data that you need to summarize or transform into a structured format for better analytics.
You have multiple large data sources to analyze, including structured and unstructured.
You want to proactively analyze big data for business needs, such as analyzing store sales by season and advertising, applying sentiment analysis to social media posts, or investigating email for suspicious communication patterns – or all the above.

With use cases like these, chances are that your organization will benefit from a big data architecture expressly built for these challenging tasks. Plan for an environment that will capture, store, transform, and communicate this valuable intelligence.

Planning the Big Data Architecture

Big data architecture includes mechanisms for ingesting, protecting, processing, and transforming data into filesystems or database structures. Analytics tools and analyst queries run in the environment to mine intelligence from data, which outputs to a variety of different vehicles.

The architecture has multiple layers. Let’s start by discussing the Big Four logical layers that exist in any big data architecture.

Big data sources layer: Data sources for big data architecture are all over the map. Data can come through from company servers and sensors, or from third-party data providers. The big data environment can ingest data in batch mode or real-time. A few data source examples include enterprise applications like ERP or CRM, MS Office docs, data warehouses and relational database management systems (RDBMS), databases, mobile devices, sensors, social media, and email.
Data massaging and storage layer: This layer receives data from the sources. If necessary, it converts unstructured data to a format that analytic tools can understand and stores the data according to its format. The big data architecture might store structured data in a RDBMS, and unstructured data in a specialized file system like Hadoop Distributed File System (HDFS), or a NoSQL database.
Analysis layer: The analytics layer interacts with stored data to extract business intelligence. Multiple analytics tools operate in the big data environment. Structured data supports mature technologies like sampling, while unstructured data needs more advanced (and newer) specialized analytics toolsets.
Consumption layer: This layer receives analysis results and presents them to the appropriate output layer. Many types of outputs cover human viewers, applications, and business processes.

In addition to the logical layers, four major processes operate cross-layer in the big data environment: data source connection, governance, systems management, and quality of service (QoS).

Connecting to data sources: Fast data ingress requires connectors and adapters that can efficiently connect to different storage systems, protocols, and networks; and data formats running the gamut from database records to social media content to sensors.
Governing big data: Big data architecture includes governance provisions for privacy and security. Organizations can choose to use native compliance tools on analytics storage systems, invest in specialized compliance software for their Hadoop environment, or sign service level security agreements with their cloud Hadoop provider. Compliance policies must operate from the point of ingestion through processing, storage, analysis, and deletion or archive.
Managing systems: Big data architecture is typically built on large-scale distributed clusters with highly scalable performance and capacity. IT must continually monitor and address system health via central management consoles. If your big data environment is in the cloud, you will still need to spend time and effort to establish and monitor strong service level agreements (SLAs) with your cloud provider.
Protecting Quality of service: QoS is the framework that supports defining data quality, compliance policies, ingestion frequency and sizes, and filtering data. For example, a public cloud provider experimented with QoS-based data storage scheduling in a cloud-based, distributed big data environment. The provider wanted to improve the data massage/storing layer’s availability and response time, so they automatically routed ingested data to predefined virtual clusters based on QoS service levels.

Big data architecture includes myriad different concerns into one all-encompassing plan to make the most of a company’s data mining efforts.

Critical Components

Let’s look at a big data architecture using Hadoop as a popular ecosystem. Hadoop is open source, and several vendors and large cloud providers offer Hadoop systems and support. There are also numerous open source and commercial products that expand Hadoop capabilities.

Core Clusters

Hadoop architecture is cluster architecture. Hadoop runs on commodity servers, and recommends dual CPU servers with 4-8 cores each, and at least 48GB of RAM. (Using accelerated analytics technologies like Apache Spark will speed up the environment even more.) Storage must also be highly scalable.

Another option is cloud Hadoop environments where the cloud provider does the infrastructure for you. The cloud might add latency, you’ll be in a shared environment, and you don’t want to be locked-in. But the cloud is an excellent choice for a new Hadoop installation, or when you know that you don’t want to grow your data center racks or IT staff to support on-premise Hadoop.

Loading the Data

Loading data onto the clusters is an ongoing event. Hadoop supports both batched data such as loading in files or records at specific times of the day, and event-driven data such as loading transactional data as the transactions occur. Software tools for loading source data include Apache Sqoop for batch loading and Apache Flume for event-driven data loading.

Your big data environment will also stage the incoming data for processing, including converting data as needed and sending it to the correct storage in the right format. Additional activities include partitioning data and assigning access controls.

Processing the Data

Once the system has ingested, identified, and stored the data it will automatically process it. This is a 2-step process of transforming the data and analyzing it. Transforming the data simply means processing it into analytics-ready formats and/or compressing it.

In Hadoop, this is MapReduce territory. MapReduce is the core component of Hadoop that filters (maps) data among nodes, and aggregates (reduces) data returned in response to a query. MapReduce achieves high performance thanks to parallel operations across massive clusters, and fault-tolerance reassigns data from a failing node. MapReduce works on both structured and unstructured data.

Many analysts and vendors run MR with additional filters, like adding collaborative filtering to MR to identify user preferences in Twitter data. Other analytics products replace it, such as Google’s proprietary Cloud Dataflow.

Output and Querying

One of Hadoop’s shining features is that once data is processed and placed, different analytics tools can operate on the unchanging data set. There is no need to re-process it for different tools, or to copy it to different locations. The same copy of data serves for all queries.

Output covers a variety of destinations, including reports and dashboard visualization for users or next step triggers in business processes.

Data Pipelines

Micro- and macro-pipelines enable discrete processing steps. Micro-pipelines operate at a step-based level to create sub-processes on granular data. In a typical scenario, one source of data is customer transactional data from the company’s primary data center. The data enters Hadoop so company analysts can investigate customer churn. However, compliance is an issue because the data includes customer credit card numbers. A micro-pipeline adds a granular processing step that cleans credit card numbers from the analyst team’s reports.

Macro-pipelines operate on a workflow level. They define 1) workflow control: what steps enable the workflow, and 2) action: what occurs at each stage to enable proper workflow.