25 Sep

Application Aware Sharding for a MySQL Database

Application Aware Sharding for a MySQL Database

Introduction

When MySQL databases get too large, it become extremely hard to maintain as well as it reaches physical limits. Under maintainability issues we normally came across,

  1. Taking too much time to ALTER a table
  2. Became impossible to take dumps
  3. Have to move tables to separate disks time to time
  4. Etc.

If someone is looking for a database sharding as a solution, this maintainability issue must have exhausted him. In this article, I am going to share my personal experience in sharding a large MySQL database in real life. With the hands on experience on this large project, I am aware of lot of things related to sharding and organization level issue, concerns and limitations when you are going ahead with this project. This attempt is to go through the project from end to end so anyone who is about to do something similar can get benefit of my knowledge.
Before going to the article, I must share several things with you.

  1. After this project, do not expect any performance gain on your day-to-day queries. This might actually add an additional overhead on your queries as your data will be shattered among multiple places. However, this will greatly improve the maintainability and will make sure that your data architecture survive in the fullness of time.
  2. What I am going to talk about is application aware sharding and there are few other ways to do sharding in the data layer, which will be seamless to the applications. However, at the time of writing I did not find and tool that supports MySQL sharding in the DB level. Existing options are either not reliable or adds a large overhead on queries.
  3. Sharding itself might not be the full answer for the problem. Sharding might need be accompanied with several sub projects that will supplement shading to reach its potential. This is feather explained the next section (“Problem and Solution”). I will be concentrating on the sharding project and will not explain much on the other supplementary projects.
  4. You must be in a real need to do this project and you must have identified this as the only way to solve this problem before proceed because what you are going to start is a big project.
  5. I have skipped some sensitive information as it relates to the company I am working for.

Problem and Solution

Problem: Data was growing in our MySQL database in rapid rates and we needed to make it constant as it lead to so many maintainability issues. With the amount of data grew bigger and number of queries per second was increased, there was even a possibility of downtimes. We needed to make the data volume in the MySQL a constant and keep it in a maintainable level.
Solution: We thought of partitioning the data and in the same time purging the old data to make size of data in one database somewhat constant. Therefore, there were two supplement projects carried out with the sharding project. One is to purge old data (purging project), and master data management (MDM project, moved some selected set of data to a central location). Data that could not be sharded due to various reasons and which global to the company was moved out, in the MDM project.

sharding-architecture, Application Aware Sharding for a MySQL Database

Read More

31 May

Continuous data loading from MySQL to Hive [Using Tungsten]

Continuous data loading from MySQL to Hive

Introduction [Skip If needed]

What is Hive

Continuous data loading from MySQL to Hive“The Apache Hive™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.”

~ Hive official site

Why Hive

Hive is been used in lot of Major companies including Facebook and Google. Hive excels at real time processing of large amount of data. Any use case that talks about querying large amounts of data in near real time can be benefited by hive. Hive is superior in availability, scalability and manageability. Hive now have capability to store complicated schemas and advance operations like table alteration. If you are a user stuck with a MySQL warehouse and want to get your data into hive for some real time complex querying, it will not be a hard job to replicate the same RDBMS schema in the hive. Following are some real world use cases of hive

  • Bizo: We use Hive for reporting and ad hoc queries.
  • Chitika: for data mining and analysis
  • CNET: for data mining, log analysis and ad hoc queries
  • Digg: data mining, log analysis, R&D, reporting/analytics
  • Grooveshark: user analytics, dataset cleaning, machine learning R&D.
  • Hi5: analytics, machine learning, social graph analysis.
  • HubSpot: to serve near real-time web analytics.
  • Last.fm: for various ad hoc queries.
  • Trending Topics: for log data normalization and building sample data sets for trend detection R&D.
  • VideoEgg: analyze all the usage data

Hope you have enough motivation now let us move into the business.

Process

Continuous data loading from MySQL to Hive

  1. Use Tungsten to get the MySQL bin logs CSV files
  2. Use DDLScan tool to create the Hive Staging and base table structures
  3. Use Map-Reduce job to load base tables from the staging tables.
  4. [Optional] Use bc tool to compare MySQL table and hive table
  5. Setup continuous loading

Read More

Last updated by .