The aim of this guide is to provide the “delta” between a software engineering interview and a data science interview, for a mathematically-oriented researcher. This guide is suitable for researchers well-versed in some combination of statistics, probability, machine learning, optimization, and related areas.
If you don’t already have software engineering interview skills, I recommend looking additionally at guides for building those skills. Almost all of those skills (writing code live, designing algorithms, etc.) and good interviewing practices (asking good questions, showing interest, communicating well) still apply here, except there will be fewer computer systems questions and less emphasis on best practices in software engineering.
Anyway, here is how I prepared for my data science interviews.
Step 0: what is data science?
Learn your audience. It’s good for context to know what exactly people mean when they are talking about data science. it’s sufficient to read the top articles from google, and I’ve provided some here.
Step 1: imagine the problems that the company will ask you
What are the questions that the company would like to know from their data? What kind of data do they have? How can the data be related to the company’s mission / bottom line / target audience? How can their data affect their day-to-day operations? How can it affect their longer-term business decisions? What kinds of methods would you use to solve these problems? What kinds of visualizations would be appropriate? How do some of these questions then translate into products or features?
Step 2: pseudo-structured wandering through wikipedia++
The goal here is primarily to familiarize oneself with the terminology of data mining, with an emphasis on the high-level concepts and what people tend to use in practice. I’ve listed a bunch of topics I reviewed below, and the goal is to develop the following for each area:
- Have an intuitive understanding (and way to explain) the concept / technique / method
- Be able to write the (main) relevant equations; know the key properties
- Have an example at hand to demonstrate usage/understanding
- Understand when it is used and why; know related tools and when it is better to use one vs the other; what are considered “good” values and how to determine/assess/validate them
Topics
- Data mining: [overview of techniques, data dredging]
- Logistic regression: [main article, R^2, covariance, multinomial logit model, the minute details]
- Hypothesis testing [p-value, type I & type II errors, Wald test, likelihood ratio test]
- Data centering: [src]
- Time series analysis: [main article, stochastic processes, ARMA models]
- Fourier analysis, fourier transform, FFT, Nyquist: [aka spectrum analysis, power of a signal]
- Probability theory: [CLT, random variables, convergence of RV, more convergence]
- SVMs [derivation, perceptron]
- Decision trees: [random forests, bootstrap, bootstrap aggregation]
- Collaborative filtering
- Probability problems
- PCA/SVD
- Stochastic gradient descent
- Data visualization
- Inverse covariance
- Taylor expansion / Mclauren’s series
- HMMs
- Geospatial prediction/analysis methods
Thesaurus
- Learning rate (machine learning) == step size (convex optimization)
- Spectrum analysis == frequency domain analysis == spectral density estimation
- Predictive analytics == predictive modeling and forecasting
- Multinomial logistic regression == softmax regression == multinomial logit == maximum entropy (MaxEnt) classifier == conditional maximum entropy model
Step 3: okay, what do other people do for interview prep?
Additionally, there is an explosion of data science books (e.g. this or this) and blogs that I’m sure are also very useful for data science interviews.
If you are a grad student in a technical field, leave a comment with your interview preparation techniques!