My Big Data solution using AWS services…


A global advertising agency that manages marketing for different customers in Asia, Europe and US required the solution on development of a Big Data platform. The company data analysts required a Big Data solution to run their models, reports and development effort can be handled by their own IT Team. The company is looking for recommendations on how to setup the Big Data Platform that will allow them to analyse trends and patterns over time across different clients. They would need a presentation layer to provide reporting capabilities to individual clients on only their specific data. Therefore, the main goal of this document is proposing a solution for IT team, Analyst and other stake holder so that it can be managed flexible, elastic, fault-tolerant, cost-efficient, scalable, secure and high-performance Big Data platform.

1.1 Requirement Overview

Below is the short diagram of interaction between the components of user requirement. There are various files and Twitter api feed required to ingest in cloud data store and cloud services would be used to process and store this information. Visualization would be a cloud service with real-time view of data and aggregated results stored into the database using AWS services.

1.2 Quality goals

Moving to AWS cloud could bring scalability (Horizontal and vertical), elasticity based on consumer demand, efficiency, security, simplicity, on-demand network access to a shared pool, ease-of-use, ease-of-deployment.

1.3 Stakeholder

2. Assumptions

Few assumptions are made to present the solution such as each customer load individual impression data into S3 bucket and which is in the form of star schema means we have structural dataset for each customer present in S3(Say landing zone). These files filter a lot of information and resultant dataset is small therefore not proposing any Datawarehouse solution such as Redshift. Explanation on hardware and infrastructure decisions (for example, size of EC2 and version/choice of programming language) is not shown in this document.

3. Proposed Architecture

Below two figures are the representation of logical and physical architecture of proposed solution.

3.1 Logical Architecture

3.2 Physical Architecture

4. Solution Strategy and Building Blocks

Below is a short summary and explanation of the fundamental decisions and solution strategies, that shape the system’s architecture.

4.1 Data sources

1. Twitter APIs: This is Twitter enterprise APIs view endpoint that receive a stream of tweets.

2. Google’s Impression double click data files after conversion to Star schema.

3. User csv files.

4.2 Analysis Types and Analysis Mode

The solution I have proposed is a BigData analytic solution to a prospective client domain is Ad Targeting & Internet advertising. Before going into the proposed solution let me explain what is an analytic solution client wants to implement.

The solutions are based on Lambda Architecture using the AWS cloud Big Data offerings. The choice of technologies and frameworks are based on what I understand with requirement document is that we are trying to do descriptive analytics by analyzing past impression clickstream data. The statistical model on this data that the user wants to apply is not described in the client’s requirement leaving me to suggest a reporting service that supports such functionality. Predictive analytics and prescriptive analytics can be suggestive when I collect more information about the data sources and therefore I am leaving the data collection of the service layer for user experience and modeling.

Let me get few minutes of your time to understand the business.

Search and display advertisements are the two most widely used approaches for Internet advertising. In search advertising, users have displayed advertisements (“ads”), along with the search results, as they search for specific keywords on a search engine. Advertisers can create ads using the advertising networks provided by search engines or social media networks. These ads are set up for specific keywords that are related to the product or service being advertised. Users searching for these keywords are shown ads along with the search results. Display advertising is another form of Internet advertising, in which the ads are displayed within websites, videos, and mobile applications that participate in the advertising network. Display ads can either be text-based or image ads. The ad-network matches these ads against the content on the website, video, or mobile application and places the ads. The most commonly used compensation method for Internet ads is Pay-per-click (PPC), in which the advertisers pay each time a user clicks on an advertisement. Advertising networks use big data systems for matching and placing advertisements and generating advertisement statistics reports.

So In advertizement industry have one problem then dont have the insight on what works for them and what does not by spending money to advertize their banner on websites. This will remain challenge until they don’t have the meaningful insight on data they collect from the target systems.

I guess this is the reason the Clients think they can use big data tools for tracking the performance of advertisements, optimizing the bids for pay-per-click advertising, tracking with keywords link the most to the advertising landing pages, and optimizing budget allocation to various advertisement campaigns.

4.2 Data Ingestion

1. User load double click data to a pre-defined s3 bucket into different landing zones (S3 buckets) for the customer data.

2. Twitter API ingestion at amazon EKS cluster where Kafka cluster is running as docker container in Amazon EKS cluster worker node.

3. User csv files ingested at S3.

Now lets move to bit by bit on the proposed solution starting from data collection.

4.3 Data Collection

Data collection is the first step for any analytics application. Before the data can be analyzed, the data must be collected and ingested into a big data stack. The choice of tools and frameworks for data collection depends on the source of data and the type of data being ingested. We have three data sources only, named clean stream, CSV files, and twitter API feed. But as per the requirement document which suggests that clickstream data loaded by the customer on S3 predefined location. So ingestion services are S3, API ingestion using docker container service running on EKS.

Twitter API streaming the data using Kafka connector.

The Data Access Connectors includes tools and frameworks for collecting and ingesting data from various sources into the big data storage and analytics frameworks.

Publish-Subscribe is a communication model that involves publishers, brokers, and consumers. Publishers are the source of data. Publishers send the data to the topics which are managed by the broker. Publish-subscribe messaging frameworks such as Apache Kafka.

4.4 Data Preparation

Clickstream Data: Clickstream data generated by web applications which can be used to analyze browsing patterns of the users.

Social Media: Data generated by social media platforms for the first data source and clickstream data, so that the data is in start schema that means a data warehouse was a sort of source system for the files loaded by the customer and with a well-defined schema.

So if any data simple transformation is required can be done by using AWS Glue and to discovers the schema of your data. AWS Glue has wizard-like to generate simple scripts and create jobs to load data anywhere in AWS services like a database. Aws Glue useful in case data is of dirty shape and can have various issues that must be resolved before the data can be processed, such as corrupt records, missing values, duplicates, inconsistent abbreviations, inconsistent units, typos, incorrect spellings, and incorrect formatting. Data preparation step involves various tasks such as data cleansing, data wrangling or munging, de-duplication, normalization, sampling, and filtering. Data cleaning detects and resolves issues such as corrupt records, records with missing values, records with bad formatting, for instance. Data wrangling or munging deals with transforming the data from one raw format to another.

Now Twitter data we are using here Kafka cluster which is installed on EKS Elastic Kubernetes services. It is a container-based cluster that is running using Operator. As we know the Kubernetes is good for stateless applications such as Web App where it loads the information about the design of web page content and passes it to application. But when we required the stateful services to be up and running on Kubernetes we found it challenging to achieve. Therefore Kafka instead of running on Kubernetes as containers I am offering to run it using Operator where we can have better control over the state of Kafka cluster by defining custom APIs to interact with Kubernetes. Operators make it easy to configure and manage the Kafka Cluster on Kubernetes.

4.5 Kafka Operator Features

Fixed deployment of a Cluster, Services, and PersistentVolumes

Upscaling of Cluster (eg adding a Broker)

Downscaling a Cluster, without data loss (removes partition of broker first, under development)

4.6 Processing Flow

Simple transformation is done at Athena where data is aggregated and sent to Amazon QuickSight. Athena charges you an amount per TB of data scanned, charged to a minimum of $10.

Spark processing the Kafka streams and send the result to store for resistance store HBase.

Hbase is further processing the incoming data for meaningful indexing.

4.7 Data Storage

Database storage for analytics solution is Amazon RDS where aggregated and filtered output processed by Athena and Spark are stored.

Twitter Feeds are stored in Amazon Managed Apache HBase on Amazon EMR (Amazon S3 Storage Mode). Amazon EMR enables you to use Amazon S3 as a data store for Apache HBase using the EMR File System and offers the following benefits:

• Separation of compute from storage— You can size your Amazon EMR cluster for compute instead of data requirements, allowing you to avoid the need for the customary 3x replication in HDFS.

• Transient clusters—You can scale compute nodes without impacting your underlying storage and terminate your cluster to save costs and quickly restore it.

• Built-in availability and durability—You get the availability and durability of Amazon S3 storage by default.

• Easy to provision read replicas—You can create and configure a read-replica cluster in another Amazon EC2 Availability Zone that provides read-only access to the same data as the primary cluster, ensuring uninterrupted access to your data even if the primary cluster becomes unavailable.

4.8 Indexing the twitter data:

Amazon ElasticSearch can automatically index for ranking, sorting and index new arrival row.

5. Visualization Analytics architecture topology

Amazon QuickSight is reporting tool connected to provide visualization on batch data as well as Real time feed data.

Visualization can be applied on top of data collected at S3 storage as well as Read Only RDS database slaves. High performance, throughput and effective load management is achieved using multiple copies of RDS slave database separated by Read and Write requirement of each instance. Amazon Athena working as ELT(not ETL) where it filter the start schema tables and store them into RDS.

6. Security

Security at user access is achieved using https connection and within the AWS datacenter by creating Amazon VPC. VPC is a virtual network and can be created separately for application layer and database layer. 

 Disaster recovery can be addressed by choosing different availability zones for any database resource. Along with that multiple replica’s volume (EBS) attached to database servers.

User statistics and notifications can be achieved using AWS alert monitoring tool and Cloud Watch.

Docker container image are safe because its running in Amazon managed Kubernetes cluster called as EKS.

Archive process is used to archive historical data to S3 bucket at cold storage.

Entire architecture can be used to create template which can later work as prototype for another application.

7. Runtime View

The proposed solution following the Lambda architecture processing technique and is capable to handle huge amount of data in an efficient manner.

7.1 Batch Layer

The batch layer is responsible for managing impression data and csv files. Amazon Athena and Spark engine perform to aggregate the batch output and store them into database.

7.2 Speed Layer

The speed layer is used to provide results in a low-latency, near real-time fashion. The speed layer receives the arriving data such as Twitter feed and performs filter/incremental updates using Apache Spark running on Amazon EMR to the batch layer.

7.3 Serving layer

The Amazon Elastic Search, Hbase running on EMR and RDS are as serving layer enables various queries of the consolidated result sent from both the batch and speed layers.

8. Cost Flexibility

S3 as ingestion data store which is cheap compared to other data store. Amazon EMR also using S3 as data store for Apache Beam using EMR file system along with processing of any data using Apache Spark.

Cost can be further reduced if we use auto scaling ON for EKS and whenever we have demand for fewer resources, the cluster downgrade automatically.

We are using Athena as there is no cluster management is required and pay for only how much data we process.

Leave a Reply

Your email address will not be published. Required fields are marked *