5 Questions I Wish I Had the Answer to Before Interviewing for Data Science Positions

Why interviewing for Data Science positions is like riding a roller-coaster.

Clément Delteil 🌱
Towards AI

--

Enigmatic photo of the end of a roller coaster plunging into the void
Photo by James Butterly on Unsplash

You spent hours polishing your cover letter to match the job post. You made your friends proofread it. You applied the corrections. When you finally think it’s ready, you take your courage in both hands, you click on the submit button.

That’s it. 2 days passed, and still no answers. You figured it was the weekend, that’s probably why they didn’t contact you… 2 weeks passed, and still no answers.

Suddenly a company you completely forgot about contacts you. They think you’re the perfect match for the job. You get excited. Finally, somebody recognized your talent. You spend the next 5 days preparing for the interview. They told you they were going to ask questions about yourself and machine learning in general. You brush up on your lessons; you think you’re ready. Today is the day you’re going to get hired.

The first part of the interview was about yourself. Now they’re going to verify you have the technical capabilities to do the job. The first question is easy, you ace it. But then the second question comes in. You didn’t think they were going to ask anything related to boosting, as the job offer didn’t mention it.

If you were discussing with your friends, you would have probably answered the question correctly, but you’re not with your friends, and you have to handle the stress of being interviewed. I know this feeling as it happened to me a few weeks ago.

This article is a guilty list of questions I failed to answer due to stress and bad preparation. Let me reassure you. I managed to get hired, so if you can answer the majority of questions on the list. That means you’re on the right track. Keep applying!

1 — Back to basics: Naive Bayes Theorem

A seismological research institute created a model to predict the occurrence of earthquakes.

The model predicts an earthquake with a probability of 80% when an earthquake is indeed imminent and with a probability of 10% when no earthquake is impending. The unconditional probability of experiencing an earthquake is 20%.

If the model predicts an earthquake, what is the probability that an earthquake will indeed occur?

The closer we get to the end of our studies, the less we use the theoretical foundations we acquired at the beginning. The Naive Bayes Theorem is a case in point. When you’re bombarded every day with the latest LLMs and the new trending library for time series, you start to lose sight of the basics.

All the information you need to answer this question is provided in the statement. If you remember the formula 😉.

Given this formula, you need to define what event A and event B correspond to.

  • P(A) → Earthquake occurs
  • P(B) → Model predicts an earthquake

Knowing this, we need to identify the information given in the statement to replace the elements in our formula.

  • P(A) = 0.20 → unconditional probability of experiencing an earthquake
  • P(B|A) = 0.80 → probability of the model predicting an earthquake when an earthquake is indeed imminent
  • P(B|¬A) = 0.10 → probability of the model predicting an earthquake when no earthquake is impending

Substituting the values into the equation:

The exercise aims to determine P(A|B). We currently lack P(B). To calculate this probability, we can use the law of total probability. It states that the probability of an event can be calculated by considering all the ways it can occur.

P(¬A) represents the event of an earthquake not occurring, which is the complement of A. Since there are only two possibilities (either an earthquake occurs or it doesn’t), we can calculate the P(¬A) as

Substituting the values into the equation:

Now we have all the elements, all we have to do is substitute them into our initial equation:

The probability that an earthquake will indeed occur, given that the model predicts an earthquake, is approximately 66.67%.

There was nothing complicated about this exercise, but you need to be able to put these equations together and be used to doing this process in your head so that you can respond quickly in an interview!

Bottom photo of a plane in flight in a pink and blue sky
Photo by Philip Myrtorp on Unsplash

2 — 100 Passengers boarding a plane…

100 passengers are boarding a plane with 100 seats. Instead of sitting in their assigned seat, the first passenger chooses a random seat. All subsequent passengers will sit in their assigned seat if it is available, or choose a random seat if not.

What is the probability that the 100th passenger will get to sit in their assigned seat?

More probabilities? Don’t fall into the trap.

This isn’t directly related to data science, but I was asked this question in an interview. Remember, you’re not just being tested on your technical skills, but also on your ability to formulate a line of reasoning orally and refine it with feedback from the interviewers.

Imagine being asked this question after the first one about earthquakes. Your head is so engrossed in probabilities that you immediately start writing equations and so on.

What if I told you that the answer to this question was only 1/2.

This counter-intuitive result is what this question is all about. It’s a question of logic before it’s a question of probability.

Note: The following two intuitions for understanding the solution are taken from a post on StackExchange. References are given a little later.

First Intuition

The first way to get an intuition is to imagine the problem with just 5 people because it all comes down to the same thing. Choosing 100 people prevents you from seeing the implications of the problem.

Let’s consider the first passenger choices.

  1. If he chooses his seat then, as stated in the problem, all subsequent passengers will sit if in their assigned seat as it’s available (including the 5th or 100th passenger).
  2. If he chooses the last seat then the 5th or 100th passenger won’t get their seat.
  3. Otherwise, he chooses another seat, and the game continues.

Everyone is going to get his seat up to the person that was chosen in the first round. If the first person had chosen seat 67, then seats 2 through 66 will have their rightful owners, and passenger 67 will have to choose between seats 1, 100, or something else. In each case, if 1 is chosen, then the last person gets his seat, if 100 is chosen, he doesn’t get his seat, and if something else is chosen, then the game continues.

In this sense, we are essentially spinning the wheel until we get either a 1 or 100, so the probability that the last person gets his seat is 1/2 [1].

Second Intuition

Another way to understand this problem is to observe that the fate of the last person is determined the moment either the first or the last seat is selected! This is because the last person will either get the first seat or the last seat. Any other seat will necessarily be taken by the time the last person gets to ‘choose’.

Since at each step, the first or last seat is equally probable to be taken, the last person will get either the first or last one with equal probability: 1/2 [2].

Lesson

The order and direction of the questions can be designed to trick you. For each question, even if the answer seems obvious, take the time to explain your reasoning orally. You’re not just being assessed on your ability to find the right answer as quickly as possible.

So take your time before asserting anything, and discuss your solution with the interviewers.

They’ll guide you in the right direction.

3 — What is a statistical test?

We’re used to using statistical tests in our machine-learning projects. They can be used to compare averages, study correlations between certain variables, and many other things. But can you explain in concrete terms what a statistical test is?

What does the p-value represent?

The “p” stands for probability, and its value ranges from 0 to 1, but what else? In concrete terms, what does this value represent?

It represents how likely the result was obtained due to randomness or chance. [3]

The most frequently used value is p = 0.05, representing a 5% chance that the results occurred by chance. This value was arbitrarily defined a long time ago and has become standard practice. In some cases, it is possible to choose a lower value, such as p = 0.001, to make the test statistically highly significant.

This value is part of a group of techniques designed to ensure the statistical significance of your experiments.

What are some limits of the p-value?

If you’ve managed to explain correctly what p-value is and what it represents, you can also add a second dimension to your answer by showing that you understand the limitations of this tool. Here’s a non-exhaustive list:

  1. Sample size — The p-value is influenced by the sample size. Larger sample sizes tend to produce smaller p values even if the effect size is relatively small. So, a small p-value doesn’t necessarily imply a large or practically significant effect.
  2. Effect size — The p-value alone doesn’t provide any information about the magnitude and importance of the observed effect. Cohen’s d or correlation coefficients should be taken into account to understand the practical significance of the findings.
  3. Assumptions — The p-value validity relies on the fact that the data must be independent and identically distributed.
  4. Misinterpretation — The p-value isn’t the probability of the null hypothesis being true or the probability of making a Type I error. It only quantifies the strength of evidence against the null hypothesis.

To find out more, read this article by Terence Shin.

Cute photo of a child imitating glasses around her eyes with her hands on daddy’s back
Photo by Edi Libedinsky on Unsplash

4 — Teach me a machine learning algorithm like I’m 5

Unless you work in a lab where everyone specializes in artificial intelligence, you’ll have to explain your work and how it works to “non-technical” individuals. A good way to practice is to have fun trying to explain complex machine-learning concepts to your friends.

You can start by explaining the concept of decision trees. First, choose an example of an application close to your interlocutor. Then, try to explain how a decision tree is constructed and what information can be gleaned from it.

Medicine Student

Let’s say you have a friend who’s studying medicine. A classic example is determining whether or not a patient has diabetes. Suppose we have access to the following information in the patient’s file:

  1. Age — Age of the patient
  2. Gender — Gender of the patient
  3. Body Mass Index (BMI) — A measure of body fat based on height and weight
  4. Blood Pressure — The patient’s blood pressure reading
  5. Diet — The patient’s dietary habits (Healthy/Unhealthy)
  6. Exercise — The patient’s exercise routine (Active/Sedentary)

Suppose a computer has access to the recordings of thousands of encounters between a doctor and a patient where the question is whether or not the patient has diabetes. In these interviews, the doctor asks questions and carries out a few tests. Based on the answers and test results, he can make a diagnosis.

Now imagine if we could visualize those encounters like a flowchart with different questions. At the top of the tree, you’d get something like “Is your blood level sugar high?”. If the answer is “yes”, the doctor might ask “Do you have a family history of diabetes?”. But if the answer is “no”, then the doctor will ask a different question.

The doctor will keep asking questions until he reaches a final prediction. In this case, the prediction would be “You have diabetes” or “You don’t have diabetes”.

Note: The example provided is an illustration that shouldn’t be considered a definitive model for diabetes prediction.

5 — What are your first reflexes when discovering a dataset?

If you’re asked this question, you can start by mentioning all the well-known steps involved in the classic machine learning process. But, if you’ve made it this far in the recruitment process, you can be sure that everyone else knows these steps too, because these are the ones that every data scientist follows and learns from his or her very first projects.

As with other questions, you have to learn to grasp what is being asked of you.

Why are you being asked such a classic question? It’s like being asked the difference between bias and variance for a machine learning model. One possibility is simply that it allows the company to verify your knowledge. Another may be that they’re trying to identify candidates who have grasped the stakes of these stages. Those who don’t jump straight into the modeling part.

In any case, you need to show that you know all these steps. To do this, you can use a recent project as an example, and take the time to explain each step of the process. This is the perfect time to put forward a project you haven’t yet mentioned in connection with the company you’re interviewing.

If you’re still struggling with this process, I can recommend this article by Prasad Patil.

Conclusion

So were my questions really that difficult? Reassured?

If, after reading this article, you found them very simple, so much the better! My aim wasn’t to impress you with extremely difficult questions that you were never going to get in an interview. I may even have managed to answer some of those questions in my interviews 😉.

Remember, take your time when answering, and ask the speakers to clarify the questions to make sure you understand what you’re being asked.

Good luck with your interviews!

Want to connect?

I’ve also written:

References

[1] John Douma (https://math.stackexchange.com/users/69810/john-douma), Taking Seats on a Plane, URL (version: 2022–10–05): https://math.stackexchange.com/q/4545881

[2] Aryabhata (https://math.stackexchange.com/users/1102/aryabhata), Taking Seats on a Plane, URL (version: 2021–07–28): https://math.stackexchange.com/q/5596

[3] https://towardsdatascience.com/defining-the-p-value-for-everyone-9103130f4fc2

--

--

Machine Learning Engineer 🌱 | French CS Engineer | Canadian MSc in AI | Data is my anchor in exploring all realms 🌍📊 | linkedin.com/in/clementdelteil/