31 Mar

Microsoft SQL Server 2016 Database with IMDB 2013 Dataset

Microsoft SQL Server 2016 Database with IMDB 2013 Dataset

Recently I wanted to run the JOB benchmark for an experiment. This benchmark uses an IMDB dataset, published in 2013. Initially, I had some trouble running the benchmark as it was designed for a PostgreSQL database. And the dataset was created in a UNIX system which can create issues when used in a Windows system. So I decided to share the exact steps you need to take to take in order to create a Microsoft SQL Server database with IMDB dataset. All the scripts used in the project can be found in this Git repo.

Step 1: Create a database in SQL server.

I simply used SSMS for this but you are free to use any method that you are familiar with. If you want to specifically set the data and log locations for the data set you can use the below script. Replace the db_location with the right location for your system.

Step 2: Create Tables

Don’t use the script you find in the JOB benchmark or with the dataset. This script is created for PostgreSQL and will not work as it is for SQL Server. You can use the edited version below. Note that I have used limits for all the VARCHAR columns under 8000 length. This limit was based on the maximum length of each column. For columns with maximum length over 8000, I have used the VARCHAR(MAX) type.

Step 3: Convert data files to a SQL server/Windows-friendly version

Extract the downloaded zip file (I used 7-zip and had to extract it twice to access the files). Seems like these files were created in UNIX system and there are differences in the escaping and the line ending. I run the following python code to edit the files. It creates the edited files in a separate folder called ‘edited’ inside the original folder. Replace the data_location with the right location for your system. Make sure you have only data files (csv) in the data location, nothing else.

Step 4: Load data using Bulk load

Now I’m going to use the SQL Server bulk insert to insert the data into the tables. Replace the <data location> with the right location for your system. You don’t have to set the ROWTERMINATOR, FIELDTERMINATOR or theFIELDQUOTE, as we have defined the format as CSV.

Step 5 (Optional): Create Foreign Keys

Foreign keys specified in the JOB are not that clear for me. Some of them contradict with the dataset. I re-wrote them to SQL Server based on my understanding. JOB mention that these foreign keys are optional.

Step 6 (Optional): Truncate long columns

Some of you might not like to keep VARCHAR(MAX) columns as you cannot create non-clustered indexes on them. You can use the below scripts to truncate the columns and ALTER the table.

Next steps with running the JOB will be added soon…

Last updated by .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.