First Impressions. The pipeline

I became a data scientist at the beginning of January. Until then I was in Academia, and even now I don’t know exactly where I should “put” myself. So, when I introduce myself to people, I say these two things: “I was in Academia, now I am a data scientist.” In this post, I’d like to compare the work I do now, to what I did as a researcher at the university.

I did a PhD in Econometrics, and most of my research was theoretical. Although, I always wanted to solve “real world” problems and frequently the theoretical questions were motivated by an empirical problem. For example, I was working on the estimation and inference of dynamic econometric models. I also contributed to the literature on estimation of macroeconomic models, in particular, dynamic stochastic general equilibrium (DSGE) models.

Theoretical research pipeline for me looked like this: There is a (nice) problem, we check the literature for the available solutions to similar problems, extend (or come up with) a solution (normally on a piece of paper) using applied probability and measure theory. Run simulations to check whether the theory is correct and to compare with existing methods. Make an application (sometimes not clearly why) to show how the new method changes the results of an empirical research or can be useful to answer a titillating question. Finally, write everything down and rewrite many times until convergence.

My work as a data scientist so far is a bit different. The start is similar: There is a problem (normally it is an applied question) or request to find something in the data. But the differences are clear from the beginning.

First, I get the data. It is easy when the data are 5 time series with 230 observations each, but it is slightly more challenging when there are 19 million observations for 2.5 million users, and the data file is several Gb. Typically this step requires uploading and downloading a dataset to an online database.

Then, I check the internet to find solutions to similar problems (algorithms or models). Install and figure out how to use the packages. This part is usually done using R and Python since they are free and open-source.

Next, I need to clean the data and construct the features (variables). I use the tool we developed at Coders Co. that is called Rax. Other solutions include Pandas, Spark or R and Python. Some of them can only be used if the datasets are small.

Finally, I train (estimate) models, run model specification tests and compare several models. Often it requires selecting some tuning parameters using rules of thumb. This part is again fairly different to Academia. In scientific articles, everything needs to be well motivated, and the steps you make must be well motivated and “correct”(whatever this means). Industry demands results above everything, sometimes sacrificing precision, sometimes even using methods that make no sense.

Moreover, theoretical considerations that are critical in econometric research are not often useful. Sure, it helps to know the way methods are constructed to understand the “black-box”, but small sample properties are not so useful given how big the data sets are. Also, the correct specification is less of an issue when the goal is to predict and not to test a hypothesis.

Having said that, I arrive at the output of my work as a data scientist. The results of my research are figures, tables and reports. So, that part is quite similar to Academia, although it doesn’t involve 2 years of revision.

In the next post, I will discuss the software I use and compare it to the software I used in Academia.

Kick out your passengers fairly

The Problem

At Coders Co. we make the world a better place by helping airlines to kick out their passengers in a fair way.

We all heard that using bad algorithms generates a very wrong kind of buzz on social media. Fortunately, it doesn’t have to be the case. In the following lines, we provide a solution and allow you to kick out your passengers truly randomly giving all the people the same chance to be thrown out. There is more the algorithm can take into account exceptional cases — a wealthy client that just bought the ticket, a 150-kilo athlete that might not be too easy to drag — you only need to know where people that you make exceptions for sit.

We start with Figure 1 that shows a scheme of a typical plane with seats and rows. We assume you know the allocation of people. We first solve a couple of fairly simple cases to build up intuition and then present a more general solution. For now, we assume there are no exceptions and we will cover other cases later.

Figure 1. Typical plane scheme
Only individuals

Assume, that we checked in N people and we need to kick out k of them. The simplest case is if all our clients travel on their own. In that case, we can randomly pick k of them, and that will be fair. For example, we can write down all the names of people and put them into a hat to draw similar to Figure 2.

 

Figure 2. Draw from a hat like this
Individuals and couples

Now suppose we also have some couples that travel together. For simplicity, assume that the total amount of checked-in passengers N=5 among them we have 3 individual travellers: A, B, and C; and a couple D E. Suppose, we need to kick out k=2 persons. We assume that we only want to kick out D and E together since they are travelling together and we want to treat our customers well.

We can pick 2 persons from the individuals, and there are 3 possible combinations: AB, AC and BC. Alternatively, we can kick the only natural couple we have DE. Now it is the time to write the possible pairs on a piece of paper and put it in the hat. The only thing is to make sure every person has the same chance to be picked we need to put the pair DE to the hat twice. The reason is anyone except for DE is a member of two possible combinations and will be in the hat on two pieces of paper. Therefore, the hat pieces of paper composition is as follows: AB \times 1, AC \times 1, BC \times 1, DE \times 2.

Now, suppose there are N passengers that we checked-in, p couples and we need to kick 3 people. We can use a similar logic to solve this case. If we only pick from individuals, each of them can be a member of (N-p\times2-1)\times (N-p\times2-2) tripletons. On top of that, he can join one of the p couples to form a tripleton. As far as members of couples are concerned, they can make a tripleton with one of the (N-p\times2-1) individuals. Now, to make sure everybody has the same chance to be selected we need to balance the number of times tripletons are build from individuals with the number of times they are made of individuals and couples.

Suppose that each triplet constructed from individuals only is there once and the tripletons constructed from individuals and couples are there W times. Then, each individual is there

    \begin{equation*} (N-p\times2-1)\times (N-p\times2-2)+p\times W \:\: \text{times.} \end{equation*}

Whereas, each member of a couple is there

    \begin{equation*} N-p\times2-1\times W \:\: \text{times.} \end{equation*}

Now, to ensure that everybody has the same chance of being selected those two numbers should be the same. So we get

    \begin{equation*} (N-p\times2-1)\times W=(N-p\times2-1)\times (N-p\times2-2)+p\times W. \end{equation*}

putting things that contain W to the left-hand-side and the rest to the right-hand-side we obtain

    \begin{equation*} (N-p\times2-2)\times W = (N-p\times2-1)\times (N-p\times2-2) . \end{equation*}

Solving for W yields

    \begin{equation*} W = \frac{(N-p\times2-1)\times (N-p\times2-2)}{(N-p\times2-2) } . \end{equation*}

Concluding Remarks

In general, if we have N checked-in passengers, p couples and k people to kick we can use similar logic. There are only N-p\times2 individuals and each can be a part of N-p\times2-1 different pairs (he can make a pair with any other individual). Having said that, we also can consider cases 4 people travelling together (families with kids).

If you or your company struggle with a similar problem, contact me or us at Coders Co. We might not help you kicking out your passengers, but we can develop software that picks them randomly in a fair way.