As we known, we’re in a new age which data is a great treasure to booster our business when everyone talks about big data and how to make data turn to insights. Since the boost of cloud computing technologies, we’ll be able to implement big data and its tools which are capable to store, analyze and process a large amount of data at a very fast pace as compared to traditional data processing systems. Finally, our core question return to being: how can we build our own big data project scenario on cloud efficiency and cost-effective?
To answer this question, let’s take a look at how did we do in the past. The major difference between traditional data and big data will explain why big data has become a big game changer in today’s world.
Traditionally, centralized database architecture used to process data in which large and complex problems such as ETL process which provides the underlying infrastructure for integration by performing three important functions Extract-Transform-Load are solved by a single computer system.
Just to recall what is the basics of ETL process, it is composite three important activities:
- Extract: Read data from the source database.
- Transform: Convert the format of the extracted data so that it conforms to the requirements of the target database. Transformation is done by using rules or merging data with other data.
- Load: Write data to the target database.
ETL process as described in the picture below :
Caption: ETL process in traditional data processing architecture
Centralised architecture is costly and ineffective to process the large amount of data that is why modern big data is based on the distributed architecture where a large block of data is solved by dividing it into several smaller sizes which provides better computing, lower price and also improve the performance as compared.
As cloud computing technologies developed, it made it is possible to interconnect data and applications served from multiple computing nodes even when they’re in different geographic locations. Cloud providers such as Microsoft Azure or AWS use the distributed model to enable lower latency and provide better performance for different managed cloud services in data processing architecture.
Not only the data processing architecture changes but also the target data are changing, traditional database systems are based on the structured data which is stored in the fixed format such as data in Relational Database System (RDBMS). Since recent decades increasing data volumes in a boosting way and they are becoming more and more various and more complex since new data sources such as social media data, IoT devices data etc other than tradition data source. That is what we know as 3V of big data: High volume, high velocity and high variety.
On the cloud, the solution to process big data is two key questions: what we can do with big data and how can we do to process big data. When the volume or the velocity or the variety of the data makes traditional data stores too expensive or too slow to store or analyze the data in a cost-effective manner we do need some new cost-effective ways to process them. So, what we need to process data in the cloud? Generally, there are 5 steps:
- Capture data: Also known as data ingestion. Depending on the problem to be solved, decide on the data sources and the data to be collected.
- Organize: Cleanse, organize and validate data. If data contains sensitive information, implement sufficient levels of security and governance.
- Integrate: Integrate with business rules and other relevant systems like database data warehouses, data lake etc in the cloud.
- Analyze: Real-time analysis, batch type analysis, reports, visualizations, advanced analytics.
- Act: Use analysis and virtualisation to solve the business problem.
Actually, nowadays famous cloud provider follows the same popular architecture to process data in the cloud which is known as Lambda architecture.
What is Lambda architecture?
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by combining both batch- and real-time stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data so it can be effectively used in building real-time data-processing systems.
There are three layers of Lambda architecture, there are batch Layer, speed layer and serving layer as described by Lambda architecture official repository.
Caption: Lambda architecture ( diagram from Opsgility.com which is created by Microsoft MVP , really great site to learn Microsoft Azure )
Here all data is pushed into both the batch layer and seed layer. The batch layer has a master dataset (immutable, append-only set of raw data) and pre-computes the batch views. The serving layer has batch views for fast queries. The speed layer compensates for processing time (to the serving layer) and deals with recent data only. All queries can be answered by merging results from batch views and real-time views or pinging them individually.
Lambda architecture on Azure
Let’s take a look at how Microsoft Azure works with big data streaming processing and batch processing scenario by implementing Lambda architecture with different Azure services.
And I resume the Azure services how Azure data and analytics services fit into the lambda architecture with below table:
|Batch Layer ( cold path )||The batch layer appends new incoming data into the single storage called master dataset and executes long-running, heavy calculations against the entire master dataset.||Azure Data Lake Store, Azure Data Lake Analytics,
Azure Data Factory
|Speed Layer ( hot path )||The speed layer precomputes incoming data every time when a new bit of data appears. In contradistinction to the batch layer, the speed layer doesn’t recalculate entire master dataset, instead, it uses an incremental approach. The result of the speed layer precomputation is called real-time view.||Event Hubs, IoT Hub, Stream Analytics, Azure HDInsight etc|
|Serving Layer||serving layer that merges result of batch and speed layer computations.||Power BI or custom web applications consumes provided data|
There is a schema to show how does it work on Azure.
Caption: Lambda Architecture on Azure ( diagram from Opsgility.com which is created by Microsoft MVP, really great site to learn Microsoft Azure )
How do they work in each Layer?
Let’s take a look at the core services at each layer.
Hot path – Speed Layer :
From an operations perspective, maintaining two streams of data while ensuring the correct state of the data can be a complicated endeavour, services on this layer need to read and act on immediately so that other systems can query the processed real-time data as opposed to running a real-time query themselves.
Speed Layer – Azure Stream Analytics
Azure Stream Analytics is a fully managed event-processing engine that lets you set up real-time analytic computations on streaming data. The data can come from devices, sensors, websites, social media feeds, applications, infrastructure systems, and more.
To examine the stream, we can create a Stream Analytics job that specifies where the data is coming from. The job also specifies a transformation—how to look for data, patterns, or relationships. For this task, Stream Analytics supports a SQL-like query language that lets us filter, sort, aggregate, and join streaming data over a time period.And at the end of process the job specifies an output to send the transformed data to.
Caption: example of stream analytic job on Azure
Speed Layer – Azure HDInsight
Another important Azure service for your big data project in Speed Layer is Azure HDInsight. Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform which is also known as HDP, HDP is the industry’s only true secure, enterprise-ready open source Apache Hadoop distribution based on a centralized architecture (YARN). HDP addresses the complete needs of data-at-rest, powers real-time customer applications and delivers robust big data analytics that accelerate decision making and innovation.
Apache Hadoop was the original open-source framework for distributed processing and analysis of big data sets on clusters of computers. The Hadoop technology stack includes related software and utilities, including Apache Hive, HBase, Spark, Kafka, and many others.
On Azure HDInsight, Hadoop is also a cluster type that has three core components:
- YARN for job scheduling and resource management
- MapReduce for parallel processing which was a great algorithm created by Google.
- The Hadoop distributed file system (HDFS) which was a great file system to support prallel processing
Hadoop clusters are most often used for batch processing of stored data. Other kinds of clusters in HDInsight have additional capabilities: Spark has grown in popularity because of its faster, in-memory processing.
But HDInsight is a more powerful Azure service which provides more specific cluster types and cluster customization capabilities, such as Spark, Kafka, Interactive Query, HBase, customized, and other cluster types. To have more information about these cluster types, please refer to Cluster types in HDInsight on Microsoft Azure documentation.
There are two ways to configure clusters:
- Domain-joined clusters preview: A cluster joined to an Active Directory domain so that you can control access and provide governance for data.
- Custom clusters with script actions: Clusters with scripts that run during provisioning and install additional components.
Caption : HDInsight Cluster configuration
Cold Path – Batch Layer
The batch layer is composed of the master dataset (an immutable, append-only set of raw data) and the ability to pre-compute batch views of the data that are pushed into the serving layer.
Batch Layer: Azure Data Lake Store & Azure Data Lake Store Analytics
Azure Data Lake Store is an enterprise-wide hyper-scale repository for big data analytics workloads. Azure Data Lake enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics.
Azure Data Lake Store can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs. It is specifically designed to enable analytics on the stored data and is tuned for performance for data analytics scenarios.
To know more about processing big data with Data Lake Store, please refer using Azure Data Lake Store for big data requirements.
Working with Azure Data Lake, there is also Azure Data Lake Analytics which is an on-demand analytics job service to simplify big data analytics. Instead of deploying, configuring, and tuning hardware, users can write queries to transform data and extract valuable insights. The analytics service can handle jobs of any scale instantly by setting the dial for how much power we need.
Caption: Data Lake Analytics Account creation
Batch Layer : Data Factory
As we said the batch layer precomputes results using a distributed processing system that can handle very large quantities of data. One of important service is Data Factory. Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
The pipelines (data-driven workflows) in Azure Data Factory typically perform the following four steps:
Caption: steps of pipelines in data factory ( image from Microsoft Azure documentation )
Be careful, Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment. It also allows you to monitor and manage workflows using both programmatic and UI mechanisms.
Caption: How does Azure data factory work
To know more about the real-life example about using Azure Data Factory, please refer one of the posts What is Azure Data Factory: Data migration on the Azure Cloud.
Currently, Data Factory version 2 is in the preview to know more about comparison about Data Factory version 1 vs version 2, please refer Compare Azure Data Factory V1 and V2 on Azure documentation, our best friend.
Serving Layer: Virtualisation with PowerBI:
As we said serving Layer is to merge results from Speed Layer and Batch Layer and to ad-hoc queries by returning precomputed views or building views from the processed data. If we would like to define the evolution of analytics into understandable parts and pairs each stage with a question to be answered: what happened, why did it happen, what will happen, how can we make it happen. This is what explained by Analytic maturity model below, from “descriptive” to “prescriptive” :
Caption: Data Analytics Maturity Model
At this stage, we can connect with PowerBI to analyse and Visualize Data, so that decision maker can make a right decision with these supportive analytic data diagram.
Caption: Data Virtualisation via PowerBI
We’ve introduced the general Big Data real-time data-processing and batch processing using Lambda Architecture by mapping the different Azure services designed by Microsoft.
We can also integrate many different services and tools like Redis, Hive, Hadoop, Azure Service Bus into the single data processing solution by using a bulk of languages such as Python, node.js, C# in those places where they’re fit the best. To master the general scenario of big data on Microsoft Azure will help us to integrate different big data frameworks especially when working with Hadoop Ecosystem and machine learning solutions in an effective way.
Data is our treasure like oil in last age, but how to turn data into insights to help our business growth will be a great topic to think about and turn it into action.