My Big Data solution using AWS services…

Goal A global advertising agency that manages marketing for different customers in Asia, Europe and US required the solution on development of a Big Data platform. The company data analysts required a Big Data solution to run their models, reports and development effort can be handled by their own IT Team. The company is looking […]

Reference architecture of bigdata solution in GCP and Azure…

This article is a showcase of a Reference architecture approach for the financial sector where stream and batch processing is a common part of its solution with other designs. Firstly the requirement analysis is the step to define the implementation of any use case. Therefore before moving to reference architecture we first need to understand […]

Sample Dataflow Pipeline featuring Cloud Pub/Sub, Dataflow, and BigQuery…

Streaming data in Google Cloud Platform is typically published to Cloud Pub/Sub, a serverless real-time messaging service. Cloud Pub/Sub provides reliable delivery and can scale to more than a million messages per second. It stores copies of messages in multiple zones to provide “at least once” guaranteed delivery to subscribers, and there can be many […]

How to create an Apache Beam data pipeline and deploy it using Cloud Dataflow in Java

Cloud Dataflow is a fully managed google service for executing data processing pipelines using Apache Beam. What do you mean by fully managed? Cloud dataflow like BigQuery dynamically provisions the optimal quantity and type of resource(i.e CPU or memory instances) based on volume and specific resource requirements for your job. Cloud dataflow is a server-less […]

Google Dataflow Python ValueError: Unable to get the Filesystem for path gs://myprojetc/digport/ports.csv.gz

I am using google cloud to create an event on Cloud Storage to Big Query using Apache Beam pythons library. I was executing an ETL in the “DirectRunner” mode and found no issue. But later when I take everything on dataflow to execute found an error. Below command used to upload the file and I […]

Python: Stream the ingest of data into the database in real-time using dataflow.

In my previous articles, we solve real-time data ingestion problems using various tools like Apache Kafka, Storm, Flink and Spark. I have shown you in detail that how to create such pipelines for real-time processing. In this blog, we will try to simulate a similar problem using Apache Beam and Dataflow using Python. Let’s say […]

Content Data Store(CDS) Compressing and enhancing technique…

Aggressively we are adding new features to Content Data Store(CDS) system. One of the feature that i am going to discuss here is compression technique(BigData application is incomplete without compression). And what if i tell you in CDS, we use compression along with enhancement of visual image/scanned documents. Our compression technique has two additional features:- Smaller:- […]

JRuby code to purge data on Hbase over Hive table…

Problem to Solve:- How to delete/update/query Binary format stored values in a HBase column family column. Hive over HBase table, where we cant use standard API and unable to apply filters on binary values, you can use below solution for programmability.   Find JRuby source code at github location This program written in JRuby to purge data using HBase shell and deletes […]

About the Author: Mukesh Kumar

Welcome to my blog site!!! I am Kubernetes, Big data and Hadoop expert.Bachelor of Technology (Computer Science) + 15 years of proven expertise across multiple business domains.Hold Certificates of Achievement on Hadoop, Hive, Pig from IBM, Cassandra from DataStax and Data Science from Edx. Apart from Kubernetes and Big data as my full-time profession, I […]

LAMP stack in Cloud: Building a Scalable, Secure and Highly Available architecture using AWS

1. Requirement Overview The acronym LAMP (Linux, Apache, MySQL, PHP) refers to an open-source stack, used to run dynamic and static content of servers. A small startup organization uses the LAMP stack of software. The dynamic nature of demand and projected future growth in traffic drives the need for a massively scalable solution to enable […]

Apache Spot, the open source community to continue the fight against cybercrime…

Apache Spot, force Apache community in order to fight cybercrime. Since Apache Spot earlier this year started at Intel and Cloudera, the momentum of the project is growing with Anomoli, Centrify, Cloudwick, Cybraics, eBay, Endgame, Jask, Streamsets, Webroot and other partners with the unanimous support. Use Apache Hadoop to achieve unlimited scale log management and […]

Almost Everything in Python!!!

A curated list of Python frameworks, libraries, software and resources. Inspired by awesome-php. Awesome Python Environment Management Package Management Package Repositories Distribution Build Tools Interactive Interpreter Files Date and Time Text Processing Specific Formats Processing Natural Language Processing Documentation Configuration Command-line Tools Downloader Imagery OCR Audio Video Geolocation HTTP Database Database Drivers ORM Web Frameworks […]

Past and Future of Apache Kylin!!!

Short Description: Apache Kylin (Chinese: Kirin) appears, can solve the problems based on Hadoop. Article Apache Kylin origin In today’s era of big data, Hadoop has become the de facto standards, and a large number of tools one after another around the Hadoop platform to build, to address the needs of different scenarios. For example, […]

Introduction to Spark

Introduction to Apache Spark:- Spark As a Unified Stack and Computational Engine is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines. Eventually the big data exports around the world have derived the specialized systems on top of Hadoop to solve certain problems like graph processing, implementation of […]

Kubernetes Operations Console

We need a framework using which we can build and manage a Kubernetes Cluster. When it comes to building a Kubernetes cluster, Kos Console provides you templates to choose from. Running a Kubernetes application(small, mid or big) there are lots of things that you need to know before migrating application to Kubernetes but if you are using […]

My intuition to understand eigenvalues and eigenvectors…

One of my biggest hurdles learning linear algebra was getting the intuition of learning Algebra. Eigenvalues and eigenvectors are one of those things that pop up in a million places because they’re so useful, but to recognize where they may be useful you need intuition as to what they’re doing. The eigenvectors are the “axes” […]

Apache Eagle: Real-time security monitoring solution

On January 10, 2017, the Apache Software Foundation, which consists of more than 350 open source projects and innovation initiatives, all developed by volunteer, governance volunteer and incubator volunteers, announced that Apache Eagle has graduated from the Apache Incubator Program. Eagle originated in eBay, the first to solve large-scale Hadoop cluster monitoring issues. The team […]

Hbase Administration using HBaseFsck (hbck) and other tools…

HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems and repairing a corrupted HBase. Sometime we need to run hbck in reguler interval because some inconsistencies can be transient (e.g. cluster is starting up or a region is splitting). Operationally you may want to run hbck regularly and setup alert […]

We just need to be better to each other before talking to AI

I’m not going to talk about statistics, machine learning, or AI, not even comparing any database which shows error made by AI work vs Manual work. I believe that the fundamental problems of our time are ethical, not technological. If we can figure out that part, the technology should take care of itself. I would […]

Moving to communication of events between subsystems — CQRS-ES with open source…

Before going into definitions of EP, CEP, and QSQS let us start with some basic database term and what problem we are trying to address here. We have commercial databases and database professionals those who publicized CRUD operations a lot. It is one-row-per-pattern works well in most of the projects and enough to build an […]