25 Sep

Application Aware Sharding for a MySQL Database

Application Aware Sharding for a MySQL Database


When MySQL databases get too large, it become extremely hard to maintain as well as it reaches physical limits. Under maintainability issues we normally came across,

  1. Taking too much time to ALTER a table
  2. Became impossible to take dumps
  3. Have to move tables to separate disks time to time
  4. Etc.

If someone is looking for a database sharding as a solution, this maintainability issue must have exhausted him. In this article, I am going to share my personal experience in sharding a large MySQL database in real life. With the hands on experience on this large project, I am aware of lot of things related to sharding and organization level issue, concerns and limitations when you are going ahead with this project. This attempt is to go through the project from end to end so anyone who is about to do something similar can get benefit of my knowledge.
Before going to the article, I must share several things with you.

  1. After this project, do not expect any performance gain on your day-to-day queries. This might actually add an additional overhead on your queries as your data will be shattered among multiple places. However, this will greatly improve the maintainability and will make sure that your data architecture survive in the fullness of time.
  2. What I am going to talk about is application aware sharding and there are few other ways to do sharding in the data layer, which will be seamless to the applications. However, at the time of writing I did not find and tool that supports MySQL sharding in the DB level. Existing options are either not reliable or adds a large overhead on queries.
  3. Sharding itself might not be the full answer for the problem. Sharding might need be accompanied with several sub projects that will supplement shading to reach its potential. This is feather explained the next section (“Problem and Solution”). I will be concentrating on the sharding project and will not explain much on the other supplementary projects.
  4. You must be in a real need to do this project and you must have identified this as the only way to solve this problem before proceed because what you are going to start is a big project.
  5. I have skipped some sensitive information as it relates to the company I am working for.

Problem and Solution

Problem: Data was growing in our MySQL database in rapid rates and we needed to make it constant as it lead to so many maintainability issues. With the amount of data grew bigger and number of queries per second was increased, there was even a possibility of downtimes. We needed to make the data volume in the MySQL a constant and keep it in a maintainable level.
Solution: We thought of partitioning the data and in the same time purging the old data to make size of data in one database somewhat constant. Therefore, there were two supplement projects carried out with the sharding project. One is to purge old data (purging project), and master data management (MDM project, moved some selected set of data to a central location). Data that could not be sharded due to various reasons and which global to the company was moved out, in the MDM project.

sharding-architecture, Application Aware Sharding for a MySQL Database

Read More

09 Jul

Toastmasters Round Robin session, Round Robin Master Role and New ideas

Round robin master - Harshana

What is round robin session and who is round robin master

Round Robin Session has been developed by the Toastmasters in Sri Lanka. This is a warm-up session that happens just after the introduction of the role players and just before the prepared speeches. Aim of this session is to improve the Speaking Skill and Listening Skill of members.

Round Robin Master is the person who leads this session by presenting a topic or a rolling story and handles its flow to make sure everyone gets the chance to speak up. This is highly important as this might be the only time that everyone gets the chance to speak up.

Read More

04 Apr

Redis Sorted Object Set (Sorted Hashes)

Redis Sorted Object Set (Sorted Hashes)

We all know that best way to store object in Redis is through, Redis hashes. And Redis sorted sets are used to keep a sorted set with a give score for each value. What if we need a sorted set of objects? What we need is a “Sorted set of Objects” . Which Redis does not support by default. We came up with a hybrid, simple data structure that a allows sorted object using Redis Hashes and Redis Sorted-sets.

Read More

01 Jan

Sri Lanka Holidays Calendar 2016 for office365, Google etc. in ics format

Sri Lanka Holidays Calendar 2016 for Outlook & Google

Its really important to track the holidays when you plan work ahead. But most of the online digital calendars that we use these days don’t have in built option to add the holiday calendar for Sri Lanka. Add the Sri Lanka Holidays Calendar 2016 to your digital calendar. Download ICS format SRI-LANKA-HOLIDAYS-2016 .

Adding to office365

  1. Add calendar
  2. From file (don’t go to holiday calendar, Sri Lanka is not available there)
  3. Select file to upload and a calendar that holidays will be added (Create a separate new calendar is recommended. Use RED color)
  4. Save

Read More

10 Oct

Romantic Opera (රෝමාන්තික ඔපෙරාව) By Kasun Kalhara (Lyrics)

Romantic Opera (රෝමාන්තික ඔපෙරාව) By Kasun Kalhara Lyrics

I looked for these lyrics all over the internet and finally i found it in a blog (http://dumbaat.blogspot.com/) and thought it requires a re-blog. This is one of my favorite songs.

හේමන්ත නිද්‍රාවේ, ගිලෙන හැන්දෑවක
නිසල තාවයේ මතුවෙන්නේ
ඔබේ දෑස පත්ලේ ඇවිලෙන, ප්‍රේම අද්‍යාශ යයි

සිහිනේ උමතු නිම්නේ, මා හුදෙකලාවේ නොහෙලා
ආදරේ දෙවඟන හැම රාත්‍රියකම, හැම හිමිදිරියක, හැම දහවලකම
වෙලී ගයන්න එනු මැනා සොප්‍රනෝ හඬින්

රෝමාන්තික ඔපෙරාවේ, හි සියුම් නාද සන් රචනයේ,
හැම ස්වරයකින් ම වෑ හේන්නේ, ඔබේ මගේ නොනිම් ප්‍රේමය යි.

සඳ දිය දහරාවේ, පෙඟෙන ලැමේ දෙයද,
නග්න තලාවේ මතු වෙන්නේ
ඔබේ සියුමැලි සිරුරේ උමතු ආලිංගනය යි

සිහිනේ උමතු නිම්නේ, මා හුදෙකලාවේ නොහෙලා
ආදරේ දෙවඟන සදහට ම එකට,
එක ආදරයක එක මිහිරියාව ක
ගිලී රඟන්න එනු මන දිනුම් මතින්

රෝමාන්තික ඔපෙරාවේ, හි සියුම් රංග වින්න්‍යාස යේ,
හැම රැඟුම කින්ම මැවෙන්නේ,  ඔබේ මගේ නොනිම් ප්‍රේමය යි.

රෝමාන්තික ඔපෙරාවේ, හි සියුම් නාද සන් රචනයේ,
හැම ස්වරයකින් ම වෑ හේන්නේ, ඔබේ මගේ නොනිම් ප්‍රේමය යි.

හැම රැඟුම කින්ම මැවෙන්නේ,  ඔබේ මගේ නොනිම් ප්‍රේමය යි.


19 Jul

[Scrum] Adding new priorities to the current sprint

[Scrum] Adding new priorities to the current sprint

We see changes in priorities time to time. With the nature of the team, you might need to expect newer priorities even in the middle of a sprint. New priorities might come as a production issue or a new feature development with higher ROI. I work in a service team where we get high priority items all the time. In this article, I like to talk about how we handle those issues.

What is the ideal way to handle Additional Requirements [real scrum way]

Once a sprint begins, you cannot add requirements or stories to the backlog

It is simple as above; simply you cannot add additional requirements to the sprint once the sprint begins. In addition, you need to keep in mind that if you do not obey at least one rule in scrum, you cannot tell that you are following scrum. Therefore, if you really need to be in pure scrum you need to close the current sprint and prioritize the backlog and start a new sprint with new priority list.

Ideal way doesn’t seems to be efficient for us

As you might already know there are so many hybrid versions of scrum due to scrum been inefficient in some situations. In our case, some team members that get so many ad-hoc tasks (related data base administration). Being a service team, all that members get some amount of ad-hoc tasks. We decided to fork the original scrum and make our own version of it, in search of better efficiency.

Read More

13 Jul

Setting up basic authentication for DBfit with Nginx

Setting up basic authentication for DB fit with Nginx

First you need to install Nginx in the server. I performed this in a Ubuntu server and following commands need to be altered according to the OS if you are not a Ubuntu user. This setup has two main parts. First we will install Nginx and then we will move towards the basic authentication setup for DBfit with Nginx.

Setting up Nginx (extremely straightforward)

This should install Nginx in the Ubuntu machine. But unlucky I got the following error [If you got it installed and started correctly without any errors, you can skip this section]

This happened due to the default file in the sites-available folder. You can simply remove the default file symbolic link in the sites-enabled folder and give another start to the nginx. This time it started without any issue.

Read More

31 May

Continuous data loading from MySQL to Hive [Using Tungsten]

Continuous data loading from MySQL to Hive

Introduction [Skip If needed]

What is Hive

Continuous data loading from MySQL to Hive“The Apache Hive™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.”

~ Hive official site

Why Hive

Hive is been used in lot of Major companies including Facebook and Google. Hive excels at real time processing of large amount of data. Any use case that talks about querying large amounts of data in near real time can be benefited by hive. Hive is superior in availability, scalability and manageability. Hive now have capability to store complicated schemas and advance operations like table alteration. If you are a user stuck with a MySQL warehouse and want to get your data into hive for some real time complex querying, it will not be a hard job to replicate the same RDBMS schema in the hive. Following are some real world use cases of hive

  • Bizo: We use Hive for reporting and ad hoc queries.
  • Chitika: for data mining and analysis
  • CNET: for data mining, log analysis and ad hoc queries
  • Digg: data mining, log analysis, R&D, reporting/analytics
  • Grooveshark: user analytics, dataset cleaning, machine learning R&D.
  • Hi5: analytics, machine learning, social graph analysis.
  • HubSpot: to serve near real-time web analytics.
  • Last.fm: for various ad hoc queries.
  • Trending Topics: for log data normalization and building sample data sets for trend detection R&D.
  • VideoEgg: analyze all the usage data

Hope you have enough motivation now let us move into the business.


Continuous data loading from MySQL to Hive

  1. Use Tungsten to get the MySQL bin logs CSV files
  2. Use DDLScan tool to create the Hive Staging and base table structures
  3. Use Map-Reduce job to load base tables from the staging tables.
  4. [Optional] Use bc tool to compare MySQL table and hive table
  5. Setup continuous loading

Read More

08 Apr

Reading and Understanding the Storm UI [Storm UI explained]

Understanding storm UI

I tried to find a document that explains every parameter in the storm UI, but I did not find any. So I thought of starting one of my own. I will start this with whatever information I have by now and will try to make it completes as possible. Thanks to all the forums and sites that help to find these information. Note that this is a live document and you can suggest edits though comments, as this is neither perfect nor complete.

storm_logo_winnerI will cover 4 major views that you can find in the storm UI and go through all the parameters in that view. This might lead to some repetitions, but I ignored that to keep this simple as possible. Before we start you need to know few basic terms. You might need to click and zoom the images as they are unreadable in the default zoom level.

Read More

03 Apr

Split brain issue in elasticsearch- 2 node Cassandra For Titan graph database

Split brain issue in elastic search

I recently configured a 2-node Cassandra cluster with elastic search installed in both nodes creating another 2-node cluster (Cassandra cluster was built using DataStax OpsCenter). Soon after I configured my cluster, I got to know that there are x + 1 issues when we try to get it done with only 2 nodes.

Split brain problem in elastic search

When there are, 2 or more nodes acting like master nodes we call it a split brain. This can happen when one node does not know that it is connected to a cluster with a master. When there is more than 1 master, indexing requests will be completed by both nodes and after some time two copies of the shard are diverged. We might have to do full re-indexing to overcome this issue. If you need more information about the split-brain problem here.

Read More