27 May

PRIVATE: A Privacy-preserving Data Analysis Language.

This is a new project, I’m working on from early last year. The motivation behind this project is to build a programing language that allows users to analyze private data without exposing sensitive information. Many data analysis languages (R, Python, MATLAB etc.) in the current market assume direct access to data. PRIVATE, on the other hand, performs a privacy calculation that will make sure only non-sensitive information is released to the user.

More Information:

This is the tutorial series by Simon Dennis, Founder of PRIVATE

Contribute to PRIVATE: Git-hub

27 May

Water Bill Calculator – Sri Lanka

Most people got huge water bills after some time due to COVID-19. I wanted to double-check the calculation because the amount was somewhat big. Unlucky I didn’t find any online calculator that get the job done (there was a one in waterboard, but it has a maximum limit of 60 days). So I went with the default option, Excel. I thought of sharing the excel workbook I used as It might be helpful to others. I really don’t know how the VAT calculation is done, so I used 8%. (Thanks to Madura for the suggested edit)

Read More
31 Mar

Microsoft SQL Server 2016 Database with IMDB 2013 Dataset

Microsoft SQL Server 2016 Database with IMDB 2013 Dataset

Recently I wanted to run the JOB benchmark for an experiment. This benchmark uses an IMDB dataset, published in 2013. Initially, I had some trouble running the benchmark as it was designed for a PostgreSQL database. And the dataset was created in a UNIX system which can create issues when used in a Windows system. So I decided to share the exact steps you need to take to take in order to create a Microsoft SQL Server database with IMDB dataset. All the scripts used in the project can be found in this Git repo.

Read More
15 Jan

Jaro–Winkler Similarity – How to correctly count the number of transpositions

Jaro–Winkler Similarity is a widely used similarity measure for checking the similarity between two strings. Being a similarity measure (not a distance measure), a higher value means more similar strings.
You can read on basics and how it works on Wikipedia. It’s available in many places and I’m not going into that. However, none of these sites talks about how to correctly count the number of transpositions in complex situations.

Transposition is defined as “matches which are not in the same position”. For a simple example like ‘cart’ vs ‘cratec’ it is obvious with 4 matches and 2 transpositions (‘r’ and ‘a’ are in not in the same position). But for 'xabcdxxxxxx' vs 'yaybycydyyyyyy' in the first look, all letters seem to be out of position but there are no transpositions (4 matches). For very similar 'xabcdxxxxxx' vs 'ydyaybycyyyyyy', there are 4 transpositions (4 matches). With these examples, it might not be trivial to count the number of transpositions. 

Read More
11 Nov

Sri Lanka Holidays Calendar 2020 for office365, Google etc. in ics format

Sri Lanka Holidays Calendar 2016 for Outlook & Google

Edit: If you are looking for 2021 calendar, visit https://www.malinga.me/sri-lanka-holidays-calendar-2021-for-office365-google-etc-in-ics-format/

Honouring everyone’s request I added the Sri Lankan Holiday calendar for 2020. This will allow you to add the Sri Lanka Holidays Calendar 2020 to your digital calendar (Most of the online digital calendars that we use these days don’t have inbuilt option to add the holiday calendar for Sri Lanka). Download 2020 Sri Lankan Holiday Calendar  (For Outlook get this: Download 2020 Sri Lankan Holiday Calendar – Outlook)

Adding to office365

  1. Add calendar
  2. From file (don’t go to the holiday calendar, Sri Lanka is not available there)
  3. Select file to upload and a calendar that holidays will be added (Create a separate new calendar is recommended. Use RED colour)
  4. Save
Read More
27 Jan

What-if Analysis with SQL server (Hypothetical Indexes) – Using python

What-if Analysis with SQL server (Hypothetical Indexes) – Using python

If you are a Database administrator or a developer working with a transaction database, you might have come across this problem

“Is it worthy to build that index?”

Exact answer for that question is only known once you build it. However, luckily SQL server provides you with functionality to check the workload performance under hypothetical indexes (without actually creating them)

You can find more information about hypothetical indexes here.

I will just provide you with a simple python code that will help you with the hypothetical index creation. Example code will compose of 3 parts

  1. Index creation
  2. Enabling the index (unlike the normal indexes you need to enable them before using)
  3. Executing the query under the hypothetical index

Index creation

def hyp_create_index_v2(connection, schema_name, tbl_name, col_names, idx_name, include_cols=()):
    Create an hypothetical index on the given table

    :param connection: sql_connection
    :param schema_name: name of the database schema
    :param tbl_name: name of the database table
    :param col_names: string list of column names
    :param idx_name: name of the index
    :param include_cols: columns that needed to be added as includes
    query = f"CREATE NONCLUSTERED INDEX {idx_name} ON {schema_name}.{tbl_name} ({', '.join(col_names)}) " \
            f"INCLUDE ({', '.join(include_cols)}) WITH STATISTICS_ONLY = -1"
    cursor = connection.cursor()
    logging.info(f"Added HYP: {idx_name}")

Enabling the indexes

def hyp_enable_index(connection):
    This enables the hypothetical indexes for the given connection. This will be enabled for a given connection and all
    hypothetical queries must be executed via the same connection
    :param connection: connection for which hypothetical indexes will be enabled
    query = f'''SELECT dbid = Db_id(),
                    objectid = object_id,
                    indid = index_id
                FROM   sys.indexes
                WHERE  is_hypothetical = 1;'''
    cursor = connection.cursor()
    result_rows = cursor.fetchall()
    for result_row in result_rows:
        query_2 = f"DBCC AUTOPILOT(0, {result_row[0]}, {result_row[1]}, {result_row[2]})"

Executing the query

def hyp_execute_query(connection, query):
    This hypothetically executes the given query and return the estimated sub tree cost. If required we can add the
    operation cost as well. However, most of the cases operation cost at the top level is 0.

    :param connection: sql_connection
    :param query: query that need to be executed
    :return: estimated sub tree cost
    cursor = connection.cursor()
    cursor.execute("SET AUTOPILOT ON")
    stat_xml = cursor.fetchone()[0]
    cursor.execute("SET AUTOPILOT OFF")
    query_plan = QueryPlan(stat_xml)
    return query_plan.estimated_sub_tree_cost, query_plan.index_seeks