Continuous data loading from MySQL to Hive [Using Tungsten]
Introduction [Skip If needed]
What is Hive
“The Apache Hive⢠data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.”
Why Hive
Hive is been used in lot of Major companies including Facebook and Google. Hive excels at real time processing of large amount of data. Any use case that talks about querying large amounts of data in near real time can be benefited by hive. Hive is superior in availability, scalability and manageability. Hive now have capability to store complicated schemas and advance operations like table alteration. If you are a user stuck with a MySQL warehouse and want to get your data into hive for some real time complex querying, it will not be a hard job to replicate the same RDBMS schema in the hive. Following are some real world use cases of hive
- Bizo: We use Hive for reporting and ad hoc queries.
- Chitika: for data mining and analysis
- CNET: for data mining, log analysis and ad hoc queries
- Digg: data mining, log analysis, R&D, reporting/analytics
- Grooveshark: user analytics, dataset cleaning, machine learning R&D.
- Hi5: analytics, machine learning, social graph analysis.
- HubSpot: to serve near real-time web analytics.
- Last.fm: for various ad hoc queries.
- Trending Topics: for log data normalization and building sample data sets for trend detection R&D.
- VideoEgg: analyze all the usage data
Hope you have enough motivation now let us move into the business.
Process
- Use Tungsten to get the MySQL bin logs CSV files
- Use DDLScan tool to create the Hive Staging and base table structures
- Use Map-Reduce job to load base tables from the staging tables.
- [Optional] Use bc tool to compare MySQL table and hive table
- Setup continuous loading
Read More »Continuous data loading from MySQL to Hive [Using Tungsten]