Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it…

Better late then never : Time to replace your micro-service architecture with Kafka…

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you…

In-depth Kafka Message queue principles of high-reliability

At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being…

SolrCloud : CAP theorem world, this makes Solr a CP system, and keep availability in certain circumstances.

A SolrCloud cluster holds one or more distributed indexes which are called Collections. Each Collection is divided into shards (to increase write capacity) and each…

Google Cloud Platform(GCP) overview

Google Cloud Platform – GCP is a collection of various services of SaaS, PaaS, and IaaS, and new services are still being launched day by…

Consume JSON Messages From Kafka Using Kafka-Python’s Deserializer

Hope you are here when you want to take a ride on Python and Apache Kafka. Kafka-Python is most popular python library for Python. For…

Apache Eagle: Real-time security monitoring solution

On January 10, 2017, the Apache Software Foundation, which consists of more than 350 open source projects and innovation initiatives, all developed by volunteer, governance…

Hbase Administration using HBaseFsck (hbck) and other tools…

HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems and repairing a corrupted HBase. Sometime we need to run hbck…

Advertisement attributes or Ad Attributes…An Idea!!!

Some time ago i was working on an idea called as Ad Attributes or Advertisement attributes. I’d like to share my thoughts on this idea with audience.…

We just need to be better to each other before talking to AI

I’m not going to talk about statistics, machine learning, or AI, not even comparing any database which shows error made by AI work vs Manual…

ZeroMQ Part-1

  Programs like people need to communicate, and for them we have the UDP, TCP, HTTP, IPX, WebSocket protocol to connect and other related applications.…

Book Review : The Folly of Fools

The Folly of Fools: The Logic of Deceit and Self-Deception in Human Life Finished reading this books and developed friendships with author “Robert L. Trivers”…

My advise for work under Zombie project and toxic workplace…

I would like to say that there’s a chance of salvaging the Zombie project but there probably isn’t and its not your fault. I have…

Converting PDF to Text using Tesseract…

Tesseract is unable to handle pdf files directly, therefore files first converted to a tiff using ghostscript before passing it to Tesseract. Tesseract does not…

Moving to communication of events between subsystems — CQRS-ES with open source…

Before going into definitions of EP, CEP, and QSQS let us start with some basic database term and what problem we are trying to address…

SolrCloud vs HDPSearch…

Let us start to remove some confusion we have related to SolrCloud and HDPSearch. First what is the SolrCloud:- Apache Solr includes the ability to…

How To Use the Python Debugger – PDB

The Python debugger comes as part of the standard Python distribution as a module called pdb. The debugger is also extensible, and is defined as…

AWS and GCE both great! Some more powerful configuration of load balancing puts GCE over the top…

I work with Hadoop so I come across or sometimes management ask me a common question, “Why we need Hadoop in cloud” and to answer…

Benefits of Blogging!

Yes, blogging has many benefits. First thing money and I’m not earning money writing blogs right now but many bloggers get pleasure who make money.…

PIR Sensor, a pyroelectric device…

After working on Sensors with Arduino i have dicided to pass my knowledge via blogs. I will start sharing few project that already done and…

Apache Solr Search Installation on HDP2.6 using Yum Repo

As we know that “HDP 2.6” is not bundle with “HDP Search” which includes Solr. Therefore here in two parts of article i am going…

pyshark, tshark and wireshark installation…

Python wrapper for tshark, allowing python packet parsing using wireshark dissectors. Installation All Platforms We are going to use python pip for installation if you…

Multiple WAL in Apache HBase 1.3 and performance enhancements!!!

Apache HBase 1.3.0 was released mid-January 2017 and ships with support for date-based tiered compaction and improvements in multiple areas, like write-ahead log (WAL), and…

Apache Spot, the open source community to continue the fight against cybercrime…

Apache Spot, force Apache community in order to fight cybercrime. Since Apache Spot earlier this year started at Intel and Cloudera, the momentum of the…

‘Open source’ and ‘free software’

Its my Materialist vs Idealist thought going on here. If you not find it to your reality – be patience with my arguments. First of…

Sumo Logic : Log Management Tool

This is my first face off with “Sumo Logic”. If you want a quick introduction on “Sumo Logic”, this topic will be helpful without going…

Residual Plots for Regression Analysis…

As we discussed in my last article to show you parameters to understand the accuracy and prediction of a regression model but I guess before…

Ordinary least squares regression (OLSR)

Ordinary least squares regression (OLSR)  Invented in 1795 by Carl Friedrich Gauss, it is considered one of the earliest known general prediction methods. OLSR is…

ROC curve and performance parameters of a classification model…

When we evaluate a model we analysis few parameters to verify the performance of our model. These parameters demonstrate the performance of our model using…

Content Data Store(CDS) Compressing and enhancing technique…

Aggressively we are adding new features to Content Data Store(CDS) system. One of the feature that i am going to discuss here is compression technique(BigData…

Install and smoketest R and RHadoop on Hortonworks Data Platform (HDP25-CentOS7)

Before going to Installation steps i’d like to give a small introduction on RHADOOP. What is RHadoop? RHadoop is an open source project for combine…

OCR – “Optical Character Recognition”, Set up Tesseract OCR on Centos 6.8…

OCR means “Optical Character Recognition” and Tesseract is licensed under the Apache License v2.0. Tesseract OCR configured system is able to convert images with embedded…

Almost Everything in Python!!!

A curated list of Python frameworks, libraries, software and resources. Inspired by awesome-php. Awesome Python Environment Management Package Management Package Repositories Distribution Build Tools Interactive…

JRuby code to purge data on Hbase over Hive table…

Problem to Solve:- How to delete/update/query Binary format stored values in a HBase column family column. Hive over HBase table, where we cant use standard API and unable to…

Python and Python bites

Python and Python bites “lambda”    Hi everyone, this article show you one powerful function in Python programming language called “lambda”. It can solve any small…

PG-Storm: Let PostgreSQL run faster on the GPU

  PostgreSQL extension PG-Storm, allows users to customize the data scan and run queries faster. CPU-intensive work load is identified and transferred to the GPU…

Cloud Databases & Cloud Blob…

Cloud computing is the next stage in evolution of the Internet. The cloud in cloud computing provides the means through which everything — from computing…

Past and Future of Apache Kylin!!!

Short Description: Apache Kylin (Chinese: Kirin) appears, can solve the problems based on Hadoop. Article Apache Kylin origin In today’s era of big data, Hadoop…

Heterogeneous Storage in HDFS(Part-1)…

An Introduction of heterogeneous storage type, and the flexible configuration of heterogeneous storage! Heterogeneous Storage in HDFS Hadoop version 2.6.0 introduced a new feature heterogeneous…

A Step-by-Step Guide to HDFS Data Protection Solution for Your Organization on Cloudera CHD

An enterprise-ready encryption solution should provide the following Comprehensive encryption offering wherever it resides, including structured and unstructured data at rest and data in motion.…

Apache Shiro design is intuitive and a simple way to ensure the safety of the application…

Short Description: Apache Shiro’s design goals are to simplify application security by being intuitive and easy to use… Article Apache Shiro design is intuitive and…

Performance utilities in Hive

Before taking you in details of utilities provided by Hive, let me explain few components to get execution flow and where the related information stored…

Best Practices for Hive Authorization when using connector to HiveServer2

Recently we are in process of working with Presto and configuring Hive Connector to it. It got connected successfully with steps given at prestodb.io/docs/current/connector/hive.html. An…

Tephra is open-sourced projects that adds complete transaction support to Apache HBase…

Transaction support in Hbase? Yes, a wide range of use case require transaction support. Firstly, we want the client to have great insight and fine-grained…

HPL/SQL Make SQL-on-Hadoop More Dynamic

Think about the old days when we solved many business problems using Dynamic SQL, exception handling, flow-of-control, iterations. Now when I worked with couple of…

Coding Tips and Best Practice in Hive and Oozie…

Many time during the code review found some common mistakes done by the developer. Here are few of them… Workflow mandatory item : Add this…

Out of the Box(Why Women Live Longer than Men)

Fact is men enjoy life more but at the end winners are women because they always get extra bits of years(these bits are sometimes in…

HDFS is really not designed for many small files!!!

Few of my friends new to Hadoop ask frequently what the good file size is for Hadoop and how to decide file size. Obviously it…

HBase Replication and comparison with popular online backup programs…

Short Description: HBase Replication: Hbase Replication solution can solve the cluster security, data security, read and write separation and operation Article   This article is…

Introduction to Spark

Introduction to Apache Spark:- Spark As a Unified Stack and Computational Engine is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks…