19 Jul

NetworkX visualization with Graphviz (Example)

NetworkX visualization with Graphviz (Example)

If you are trying to visualize a nice graph with NetworkX, you should be exhausted by now. After all, NetworkX only provides basic functionality for graph visualization. The main goal of NetworkX is to enable graph analysis. For everything other than basic visualization, it’s advisable to use a separate specialized library. In my case, I choose Graphviz. It’s simplistic to get an attractive visualization of a NetworkX graph with Graphviz. I’m taking a gradual start, but you may skip to “NetworkX with Graphviz” directly.

Read More
18 Jul

Index Physical Structure Example; Multi-column Non-Clustered Index with Includes

Structure of a non-clustered multi-column index with include columns.

This article demonstrates the physical design of a multi-column non-clustered index with include-columns. Many examples on the internet only demonstrate the most simple version of an index with a single column. This article gives a proper view of an index with multiple columns through a simple example. Furthermore, you can see how the include-columns are stored, only at the leaf level of the tree.

Here we use a simple table ‘People’ with 6 columns (ID, First Name, Last Name, Age, Sex, Address). We assume we already have a clustered index created on the ID column (it will be almost no difference if there is no clustered index as well, explained at the end). Now we are going to create the non-clustered index as defined below.

Below diagram shows the structure of this non-clustered index.

Structure of a non-clustered multi-column index with include columns.
Read More
27 May

PRIVATE: A Privacy-preserving Data Analysis Language.

This is a new project, I’m working on from early last year. The motivation behind this project is to build a programing language that allows users to analyze private data without exposing sensitive information. Many data analysis languages (R, Python, MATLAB etc.) in the current market assume direct access to data. PRIVATE, on the other hand, performs a privacy calculation that will make sure only non-sensitive information is released to the user.

More Information:

This is the tutorial series by Simon Dennis, Founder of PRIVATE

Contribute to PRIVATE: Git-hub

27 May

Water Bill Calculator – Sri Lanka

Most people got huge water bills after some time due to COVID-19. I wanted to double-check the calculation because the amount was somewhat big. Unlucky I didn’t find any online calculator that get the job done (there was a one in waterboard, but it has a maximum limit of 60 days). So I went with the default option, Excel. I thought of sharing the excel workbook I used as It might be helpful to others. I really don’t know how the VAT calculation is done, so I used 8%.

Read More
31 Mar

Microsoft SQL Server 2016 Database with IMDB 2013 Dataset

Microsoft SQL Server 2016 Database with IMDB 2013 Dataset

Recently I wanted to run the JOB benchmark for an experiment. This benchmark uses an IMDB dataset, published in 2013. Initially, I had some trouble running the benchmark as it was designed for a PostgreSQL database. And the dataset was created in a UNIX system which can create issues when used in a Windows system. So I decided to share the exact steps you need to take to take in order to create a Microsoft SQL Server database with IMDB dataset. All the scripts used in the project can be found in this Git repo.

Read More
15 Jan

Jaro–Winkler Similarity – How to correctly count the number of transpositions

Jaro–Winkler Similarity is a widely used similarity measure for checking the similarity between two strings. Being a similarity measure (not a distance measure), a higher value means more similar strings.
You can read on basics and how it works on Wikipedia. It’s available in many places and I’m not going into that. However, none of these sites talks about how to correctly count the number of transpositions in complex situations.

Transposition is defined as “matches which are not in the same position”. For a simple example like ‘cart’ vs ‘cratec’ it is obvious with 4 matches and 2 transpositions (‘r’ and ‘a’ are in not in the same position). But for 'xabcdxxxxxx' vs 'yaybycydyyyyyy' in the first look, all letters seem to be out of position but there are no transpositions (4 matches). For very similar 'xabcdxxxxxx' vs 'ydyaybycyyyyyy', there are 4 transpositions (4 matches). With these examples, it might not be trivial to count the number of transpositions. 

Read More