First Impressions. The pipeline

I became a data scientist at the beginning of January. Until then I was in Academia, and even now I don’t know exactly where I should “put” myself. So, when I introduce myself to people, I say these two things: “I was in Academia, now I am a data scientist.” In this post, I’d like to compare the work I do now, to what I did as a researcher at the university.

I did a PhD in Econometrics, and most of my research was theoretical. Although, I always wanted to solve “real world” problems and frequently the theoretical questions were motivated by an empirical problem. For example, I was working on the estimation and inference of dynamic econometric models. I also contributed to the literature on estimation of macroeconomic models, in particular, dynamic stochastic general equilibrium (DSGE) models.

Theoretical research pipeline for me looked like this: There is a (nice) problem, we check the literature for the available solutions to similar problems, extend (or come up with) a solution (normally on a piece of paper) using applied probability and measure theory. Run simulations to check whether the theory is correct and to compare with existing methods. Make an application (sometimes not clearly why) to show how the new method changes the results of an empirical research or can be useful to answer a titillating question. Finally, write everything down and rewrite many times until convergence.

My work as a data scientist so far is a bit different. The start is similar: There is a problem (normally it is an applied question) or request to find something in the data. But the differences are clear from the beginning.

First, I get the data. It is easy when the data are 5 time series with 230 observations each, but it is slightly more challenging when there are 19 million observations for 2.5 million users, and the data file is several Gb. Typically this step requires uploading and downloading a dataset to an online database.

Then, I check the internet to find solutions to similar problems (algorithms or models). Install and figure out how to use the packages. This part is usually done using R and Python since they are free and open-source.

Next, I need to clean the data and construct the features (variables). I use the tool we developed at Coders Co. that is called Rax. Other solutions include Pandas, Spark or R and Python. Some of them can only be used if the datasets are small.

Finally, I train (estimate) models, run model specification tests and compare several models. Often it requires selecting some tuning parameters using rules of thumb. This part is again fairly different to Academia. In scientific articles, everything needs to be well motivated, and the steps you make must be well motivated and “correct”(whatever this means). Industry demands results above everything, sometimes sacrificing precision, sometimes even using methods that make no sense.

Moreover, theoretical considerations that are critical in econometric research are not often useful. Sure, it helps to know the way methods are constructed to understand the “black-box”, but small sample properties are not so useful given how big the data sets are. Also, the correct specification is less of an issue when the goal is to predict and not to test a hypothesis.

Having said that, I arrive at the output of my work as a data scientist. The results of my research are figures, tables and reports. So, that part is quite similar to Academia, although it doesn’t involve 2 years of revision.

In the next post, I will discuss the software I use and compare it to the software I used in Academia.