10 Aug

Performance evaluation between different Druid roll-up levels

Introduction

In most datasets with a large number of events, going through individual events is less important. Most of the data use cases are around the summarized data. Druid summarizes this raw data at ingestion time using a process refer to as “roll-up”. Roll-up is the highest granularity of the data and will be able to query only up to the roll-up granularity. However, there are some scenarios where it’s important to have more granular data. However keeping more granular data comes at a cost. We did a small experiment to identify how different roll-up levels affect performance.

Rolling up data can dramatically reduce the size of data that needs to be stored (up to a factor of 100). Druid will roll up data as it is ingested to minimize the amount of raw data that needs to be stored. This storage reduction does come at a cost; as we roll up data, we lose the ability to query individual events. Phrased another way, the rollup granularity is the minimum granularity you will be able to explore data at and events are floored to this granularity. Hence, Druid ingestion specs define this granularity as the queryGranularity of the data. The lowest supported queryGranularity is millisecond. -http://druid.io

Dataset and Setup

We choose a CSV data set with millions (150M+) of records which contain sales data spanning across 2 years. CSV file was around 6 GB in physical size. This is a narrow data set with 3 dimensions and 2 metrics. We had 2 servers where all the components are deployed.

m4 large – Coordinator, Brokers, Overload nodes
r3 large – Middle managers and Historical nodes

Read More

25 Sep

Application Aware Sharding for a MySQL Database

Application Aware Sharding for a MySQL Database

Introduction

When MySQL databases get too large, it become extremely hard to maintain as well as it reaches physical limits. Under maintainability issues we normally came across,

  1. Taking too much time to ALTER a table
  2. Became impossible to take dumps
  3. Have to move tables to separate disks time to time
  4. Etc.

If someone is looking for a database sharding as a solution, this maintainability issue must have exhausted him. In this article, I am going to share my personal experience in sharding a large MySQL database in real life. With the hands on experience on this large project, I am aware of lot of things related to sharding and organization level issue, concerns and limitations when you are going ahead with this project. This attempt is to go through the project from end to end so anyone who is about to do something similar can get benefit of my knowledge.
Before going to the article, I must share several things with you.

  1. After this project, do not expect any performance gain on your day-to-day queries. This might actually add an additional overhead on your queries as your data will be shattered among multiple places. However, this will greatly improve the maintainability and will make sure that your data architecture survive in the fullness of time.
  2. What I am going to talk about is application aware sharding and there are few other ways to do sharding in the data layer, which will be seamless to the applications. However, at the time of writing I did not find and tool that supports MySQL sharding in the DB level. Existing options are either not reliable or adds a large overhead on queries.
  3. Sharding itself might not be the full answer for the problem. Sharding might need be accompanied with several sub projects that will supplement shading to reach its potential. This is feather explained the next section (“Problem and Solution”). I will be concentrating on the sharding project and will not explain much on the other supplementary projects.
  4. You must be in a real need to do this project and you must have identified this as the only way to solve this problem before proceed because what you are going to start is a big project.
  5. I have skipped some sensitive information as it relates to the company I am working for.

Problem and Solution

Problem: Data was growing in our MySQL database in rapid rates and we needed to make it constant as it lead to so many maintainability issues. With the amount of data grew bigger and number of queries per second was increased, there was even a possibility of downtimes. We needed to make the data volume in the MySQL a constant and keep it in a maintainable level.
Solution: We thought of partitioning the data and in the same time purging the old data to make size of data in one database somewhat constant. Therefore, there were two supplement projects carried out with the sharding project. One is to purge old data (purging project), and master data management (MDM project, moved some selected set of data to a central location). Data that could not be sharded due to various reasons and which global to the company was moved out, in the MDM project.

sharding-architecture, Application Aware Sharding for a MySQL Database

Read More

04 Apr

Redis Sorted Object Set (Sorted Hashes)

Redis Sorted Object Set (Sorted Hashes)

We all know that best way to store object in Redis is through, Redis hashes. And Redis sorted sets are used to keep a sorted set with a give score for each value. What if we need a sorted set of objects? What we need is a “Sorted set of Objects” . Which Redis does not support by default. We came up with a hybrid, simple data structure that a allows sorted object using Redis Hashes and Redis Sorted-sets.

Read More

19 Jul

[Scrum] Adding new priorities to the current sprint

[Scrum] Adding new priorities to the current sprint

We see changes in priorities time to time. With the nature of the team, you might need to expect newer priorities even in the middle of a sprint. New priorities might come as a production issue or a new feature development with higher ROI. I work in a service team where we get high priority items all the time. In this article, I like to talk about how we handle those issues.

What is the ideal way to handle Additional Requirements [real scrum way]

Once a sprint begins, you cannot add requirements or stories to the backlog

It is simple as above; simply you cannot add additional requirements to the sprint once the sprint begins. In addition, you need to keep in mind that if you do not obey at least one rule in scrum, you cannot tell that you are following scrum. Therefore, if you really need to be in pure scrum you need to close the current sprint and prioritize the backlog and start a new sprint with new priority list.

Ideal way doesn’t seems to be efficient for us

As you might already know there are so many hybrid versions of scrum due to scrum been inefficient in some situations. In our case, some team members that get so many ad-hoc tasks (related data base administration). Being a service team, all that members get some amount of ad-hoc tasks. We decided to fork the original scrum and make our own version of it, in search of better efficiency.

Read More

13 Jul

Setting up basic authentication for DBfit with Nginx

Setting up basic authentication for DB fit with Nginx

First you need to install Nginx in the server. I performed this in a Ubuntu server and following commands need to be altered according to the OS if you are not a Ubuntu user. This setup has two main parts. First we will install Nginx and then we will move towards the basic authentication setup for DBfit with Nginx.

Setting up Nginx (extremely straightforward)

This should install Nginx in the Ubuntu machine. But unlucky I got the following error [If you got it installed and started correctly without any errors, you can skip this section]

This happened due to the default file in the sites-available folder. You can simply remove the default file symbolic link in the sites-enabled folder and give another start to the nginx. This time it started without any issue.

Read More

31 May

Continuous data loading from MySQL to Hive [Using Tungsten]

Continuous data loading from MySQL to Hive

Introduction [Skip If needed]

What is Hive

Continuous data loading from MySQL to Hive“The Apache Hive™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.”

~ Hive official site

Why Hive

Hive is been used in lot of Major companies including Facebook and Google. Hive excels at real time processing of large amount of data. Any use case that talks about querying large amounts of data in near real time can be benefited by hive. Hive is superior in availability, scalability and manageability. Hive now have capability to store complicated schemas and advance operations like table alteration. If you are a user stuck with a MySQL warehouse and want to get your data into hive for some real time complex querying, it will not be a hard job to replicate the same RDBMS schema in the hive. Following are some real world use cases of hive

  • Bizo: We use Hive for reporting and ad hoc queries.
  • Chitika: for data mining and analysis
  • CNET: for data mining, log analysis and ad hoc queries
  • Digg: data mining, log analysis, R&D, reporting/analytics
  • Grooveshark: user analytics, dataset cleaning, machine learning R&D.
  • Hi5: analytics, machine learning, social graph analysis.
  • HubSpot: to serve near real-time web analytics.
  • Last.fm: for various ad hoc queries.
  • Trending Topics: for log data normalization and building sample data sets for trend detection R&D.
  • VideoEgg: analyze all the usage data

Hope you have enough motivation now let us move into the business.

Process

Continuous data loading from MySQL to Hive

  1. Use Tungsten to get the MySQL bin logs CSV files
  2. Use DDLScan tool to create the Hive Staging and base table structures
  3. Use Map-Reduce job to load base tables from the staging tables.
  4. [Optional] Use bc tool to compare MySQL table and hive table
  5. Setup continuous loading

Read More

08 Apr

Reading and Understanding the Storm UI [Storm UI explained]

Understanding storm UI

I tried to find a document that explains every parameter in the storm UI, but I did not find any. So I thought of starting one of my own. I will start this with whatever information I have by now and will try to make it completes as possible. Thanks to all the forums and sites that help to find these information. Note that this is a live document and you can suggest edits though comments, as this is neither perfect nor complete.

storm_logo_winnerI will cover 4 major views that you can find in the storm UI and go through all the parameters in that view. This might lead to some repetitions, but I ignored that to keep this simple as possible. Before we start you need to know few basic terms. You might need to click and zoom the images as they are unreadable in the default zoom level.

Read More

03 Apr

Split brain issue in elasticsearch- 2 node Cassandra For Titan graph database

Split brain issue in elastic search

I recently configured a 2-node Cassandra cluster with elastic search installed in both nodes creating another 2-node cluster (Cassandra cluster was built using DataStax OpsCenter). Soon after I configured my cluster, I got to know that there are x + 1 issues when we try to get it done with only 2 nodes.

Split brain problem in elastic search

When there are, 2 or more nodes acting like master nodes we call it a split brain. This can happen when one node does not know that it is connected to a cluster with a master. When there is more than 1 master, indexing requests will be completed by both nodes and after some time two copies of the shard are diverged. We might have to do full re-indexing to overcome this issue. If you need more information about the split-brain problem here.

Read More

03 Apr

Building a Cassandra cluster with dataStax OpsCenter [CentOs]

Building a cassendra cluster with dataStax OpsCenter

If you have tried creating a cluster though OpsCenter and failed, do not worry, seems like no one gets it in the first try. When we went through the installation process, we observed that there are issues when installing the dataStax agents though OpsCenter. In addition, we faced lot of issues while installing them. After doing it for several times I came to an agreement with myself of the easiest and working way to install agents and OpsCenter. If you are facing similar issues this article might give you some other options to tryout.

 Issues we faced

  •  OpsCenter cannot create dataStax agents in the cluster nodes [Cluster provisioning failed: Exception: Installation stage failed]
  • OpsCenter cannot connect to the agent installed in other nodes [There was a problem verifying an ssh login]
  • OpsCenter cannot connect to the own node with loopback address
  • OpsCenter cannot reconnect to nodes once restated the cluster
  • OpsCenter cannot connect to nodes with public IPs
  • DataStax agent not running on the node
  • Keep getting message about an old node, which is removed, is running [OpsCenter agent is still running on an old node that was decommissioned or is part of a cluster that OpsCenter is no longer monitoring].

Read More

24 Jan

Moving from md5 towards a more secure hashing technique

Moving from md5 towards a more secure hashing technique

If you have an old website or a web application with some user-base and you have encrypted user passwords with md5, sha1, sha256, etc. With the knowledge about md5 or that, old hashing techniques been compromised with brute force attacks, collision attacks, etc., you are searching for a way to convert your hashing technique to something more secure like bcrypt or crypt_blowfish. This article talks about how to do that.

Read More

Last updated by .