This is a new project, I’m working on from early last year. The motivation behind this project is to build a programing language that allows users to analyze private data without exposing sensitive information. Many data analysis languages (R, Python, MATLAB etc.) in the current market assume direct access to data. PRIVATE, on the other hand, performs a privacy calculation that will make sure only non-sensitive information is released to the user.
This is the tutorial series by Simon Dennis, Founder of PRIVATE
Recently I wanted to run the JOB benchmark for an experiment. This benchmark uses an IMDB dataset, published in 2013. Initially, I had some trouble running the benchmark as it was designed for a PostgreSQL database. And the dataset was created in a UNIX system which can create issues when used in a Windows system. So I decided to share the exact steps you need to take to take in order to create a Microsoft SQL Server database with IMDB dataset. All the scripts used in the project can be found in this Git repo.
Jaro–Winkler Similarity is a widely used similarity measure for checking the similarity between two strings. Being a similarity measure (not a distance measure), a higher value means more similar strings. You can read on basics and how it works on Wikipedia. It’s available in many places and I’m not going into that. However, none of these sites talks about how to correctly count the number of transpositions in complex situations.
Transposition is defined as “matches which are not in the same position”. For a simple example like ‘cart’ vs ‘cratec’ it is obvious with 4 matches and 2 transpositions (‘r’ and ‘a’ are in not in the same position). But for 'xabcdxxxxxx' vs 'yaybycydyyyyyy' in the first look, all letters seem to be out of position but there are no transpositions (4 matches). For very similar 'xabcdxxxxxx' vs 'ydyaybycyyyyyy', there are 4 transpositions (4 matches). With these examples, it might not be trivial to count the number of transpositions.
If you are a Database administrator or a developer working with a transaction database, you might have come across this problem
“Is it worthy to build that index?”
Exact answer for that question is only known once you build it. However, luckily SQL server provides you with functionality to check the workload performance under hypothetical indexes (without actually creating them)
You can find more information about hypothetical indexes here.
I will just provide you with a simple python code that will help you with the hypothetical index creation. Example code will compose of 3 parts
Enabling the index (unlike the normal indexes you need to enable them before using)
Being a startup is “great” as a feeling. Startup culture is filled with so much positive energy to get the things done. In this process of getting things done, one thing we miss is the proper design in a data platform. It is understandable that people start with a simple data platform and evolve it over the time. Starting with the perfect data platform is less practical when we consider the cost involved and the lack of domain knowledge in initial stages. We should all admit that proper data platform costs a lot, which sometimes not efficient for a startup. My personal opinion is to start small and to evolve with time. Here we will talk about common problems that we faced in a start-up data platform.
Scalability issues impact in several ends. Startup systems are not meant to scale until the end of time. Sometimes they become impossible to scale, sometimes scaling requires so much additional effort that they need a separate team working on scaling the data platform. Sometimes scaling is involved with a large cost that is rapidly increasing. Sometimes scaling increases the overall system complexity and reduce maintainability. If I summarize main impact area of scalability costs, it will be as follows,
Being impossible to scale
High Cost of scaling
Increasing manual tasks of Scaling
Increase in system complexity while scaling
Reduction of system maintainability
Proper data platform design should answer above concerns. Proper design should be scalable beyond the foreseeable future. While scaling it should minimize the cost additions, remove any complexity additions and should involve minimal or no manual effort.
In most datasets with a large number of events, going through individual events is less important. Most of the data use cases are around the summarized data. Druid summarizes this raw data at ingestion time using a process refer to as “roll-up”. Roll-up is the highest granularity of the data and will be able to query only up to the roll-up granularity. However, there are some scenarios where it’s important to have more granular data. However keeping more granular data comes at a cost. We did a small experiment to identify how different roll-up levels affect performance.
Rolling up data can dramatically reduce the size of data that needs to be stored (up to a factor of 100). Druid will roll up data as it is ingested to minimize the amount of raw data that needs to be stored. This storage reduction does come at a cost; as we roll up data, we lose the ability to query individual events. Phrased another way, the rollup granularity is the minimum granularity you will be able to explore data at and events are floored to this granularity. Hence, Druid ingestion specs define this granularity as the queryGranularity of the data. The lowest supported queryGranularity is millisecond. -http://druid.io
Dataset and Setup
We choose a CSV data set with millions (150M+) of records which contain sales data spanning across 2 years. CSV file was around 6 GB in physical size. This is a narrow data set with 3 dimensions and 2 metrics. We had 2 servers where all the components are deployed.
m4 large – Coordinator, Brokers, Overload nodes
r3 large – Middle managers and Historical nodes
When MySQL databases get too large, it become extremely hard to maintain as well as it reaches physical limits. Under maintainability issues we normally came across,
Taking too much time to ALTER a table
Became impossible to take dumps
Have to move tables to separate disks time to time
If someone is looking for a database sharding as a solution, this maintainability issue must have exhausted him. In this article, I am going to share my personal experience in sharding a large MySQL database in real life. With the hands on experience on this large project, I am aware of lot of things related to sharding and organization level issue, concerns and limitations when you are going ahead with this project. This attempt is to go through the project from end to end so anyone who is about to do something similar can get benefit of my knowledge. Before going to the article, I must share several things with you.
After this project, do not expect any performance gain on your day-to-day queries. This might actually add an additional overhead on your queries as your data will be shattered among multiple places. However, this will greatly improve the maintainability and will make sure that your data architecture survive in the fullness of time.
What I am going to talk about is application aware sharding and there are few other ways to do sharding in the data layer, which will be seamless to the applications. However, at the time of writing I did not find and tool that supports MySQL sharding in the DB level. Existing options are either not reliable or adds a large overhead on queries.
Sharding itself might not be the full answer for the problem. Sharding might need be accompanied with several sub projects that will supplement shading to reach its potential. This is feather explained the next section (“Problem and Solution”). I will be concentrating on the sharding project and will not explain much on the other supplementary projects.
You must be in a real need to do this project and you must have identified this as the only way to solve this problem before proceed because what you are going to start is a big project.
I have skipped some sensitive information as it relates to the company I am working for.
Problem and Solution
Problem: Data was growing in our MySQL database in rapid rates and we needed to make it constant as it lead to so many maintainability issues. With the amount of data grew bigger and number of queries per second was increased, there was even a possibility of downtimes. We needed to make the data volume in the MySQL a constant and keep it in a maintainable level. Solution: We thought of partitioning the data and in the same time purging the old data to make size of data in one database somewhat constant. Therefore, there were two supplement projects carried out with the sharding project. One is to purge old data (purging project), and master data management (MDM project, moved some selected set of data to a central location). Data that could not be sharded due to various reasons and which global to the company was moved out, in the MDM project.
We all know that best way to store object in Redis is through, Redis hashes. And Redis sorted sets are used to keep a sorted set with a give score for each value. What if we need a sorted set of objects? What we need is a “Sorted set of Objects” . Which Redis does not support by default. We came up with a hybrid, simple data structure that a allows sorted object using Redis Hashes and Redis Sorted-sets.
We see changes in priorities time to time. With the nature of the team, you might need to expect newer priorities even in the middle of a sprint. New priorities might come as a production issue or a new feature development with higher ROI. I work in a service team where we get high priority items all the time. In this article, I like to talk about how we handle those issues.
What is the ideal way to handle Additional Requirements [real scrum way]
Once a sprint begins, you cannot add requirements or stories to the backlog
It is simple as above; simply you cannot add additional requirements to the sprint once the sprint begins. In addition, you need to keep in mind that if you do not obey at least one rule in scrum, you cannot tell that you are following scrum. Therefore, if you really need to be in pure scrum you need to close the current sprint and prioritize the backlog and start a new sprint with new priority list.
Ideal way doesn’t seems to be efficient for us
As you might already know there are so many hybrid versions of scrum due to scrum been inefficient in some situations. In our case, some team members that get so many ad-hoc tasks (related data base administration). Being a service team, all that members get some amount of ad-hoc tasks. We decided to fork the original scrum and make our own version of it, in search of better efficiency.
First you need to install Nginx in the server. I performed this in a Ubuntu server and following commands need to be altered according to the OS if you are not a Ubuntu user. This setup has two main parts. First we will install Nginx and then we will move towards the basic authentication setup for DBfit with Nginx.
Setting up Nginx (extremely straightforward)
sudo apt-get install nginx
sudo service nginx start
This should install Nginx in the Ubuntu machine. But unlucky I got the following error [If you got it installed and started correctly without any errors, you can skip this section]
This happened due to the default file in the sites-available folder. You can simply remove the default file symbolic link in the sites-enabled folder and give another start to the nginx. This time it started without any issue.