Careers

I Analyzed 2k Data Scientist and Data Engineer Jobs and This is What I Found

Leverage your Python Skills to Understand the Skills and Requirements of your Dream Job

Khuyen Tran
Towards AI
Published in
6 min readJul 30, 2021

--

Motivation

Have you ever wondered what the difference in the job requirements between data scientists and data engineers is? Instead of going through many job requirements to figure that out, why not use a tool to get descriptions of all data scientist and data engineer jobs at once?

That is when Diffbot comes in handy. In this article, you will learn how to extract 2k jobs related to data scientists and data engineers in one click, and visualize the difference in keywords between these 2 jobs using a scatter plot.

Image by Author

What is Diffbot?

Diffbot is a tool that allows you to extract a trillion connected facts across the web in one click in less than 1 second.

To start with Diffbot, sign up for a 14-day free trial. This free trial will give you access to search the world’s largest Knowledge Graph and allow you to extract data from any webpage without creating rules.

Then click the Search option to search the Diffbot Knowledge Graph.

Image by Author

Once you see the search page, start with selecting an entity. Since we want to search for jobs, we will choose the entity Job:

Image by Author

Next, choose the jobs whose title contains the words data scientist and click Search.

Image by Author

And you should see something like below in less than 1s!

Image by Author

Better yet, we can download the CSV data of the search results by clicking the CSV button in the top right of the screen:

Image by Author

Review the data then click Export to export the CSV file.

Image by Author

Repeat the step above to get the search results of the title data engineer .

You can also access the data I downloaded from Diffbot using gdown:

pip install gdown

Analyze and Process the Data

Start with reading the data:

Your data_scientist DataFrame should look like below:

Messy data in different websites are organized into name , pageUrl , requirements , summary , tasks , text , and title . How cool is that?

Next, we add an additional column called title that specifies whether the job is looking for a data scientist or a data engineer, then merge the two DataFrames:

Next, filter out the jobs without any text :

Top Page Sources

What are the most frequent page sources? Start with extracting page sources using yarl:

pip install yarl

Then visualize the top 20 pages:

Visualize Text

To get an understanding of common words in jobs related to data scientists and data engineers, we will visualize the phrases in the text column. Start with cleaning the text using texthero:

pip install texthero

Then visualize all text using wordcloud:

pip install wordcloud

Cool! As we can expect from job descriptions related to data engineers and data scientists, phrases such as ‘data engineer’, ‘data scientist’, ‘big data’, ‘data analysis’, ‘data driven’, etc, are very common.

What is the Difference Between Data Scientist and Data Engineer’s Requirements?

What we are really interested in is the difference between the job requirements of data scientists and data engineers. That could easily be found with scattertext.

Start with installing scattertext:

pip install scattertext

Then process the requirements column:

Build a corpus:

Next, we get the scaled F-score of each term in each category.

Get terms with the highest data scientist F-scores:

['quantitative', 'statistics', 'r', 'machine', 'analysis', 'field', 'science', 'ability', 'computer', 'mathematics', 'analytics', 'work', 'techniques', 'degree', 'research', 'engineering', 'business', 'mining', 'environment', 'time', 'knowledge', 'python', 'experience', 'problems', 'skills', 'years', 'math', 'team', 'languages', 'bachelor']
['apache', 'kafka', 'self', 'implement', 'spark', 'dimensional', 'design', 'scala', 'storm', 'excellent', 'stream', 'building', 'java', 'g', 'operating', 'management', 'flink', 'growth', 'processing', 'hands', 'others', 'mapreduce', 'aws', 'pipelines', 'sets', 'datasets', 'e', 'perfect', 'enemy', 'ambiguity']

Aha! The terms that are most associated with the data scientist title and the data engineer title look right.

Looking at the text alone is boring. Let’s visualize the difference in terms between these two job titles using a scatter plot.

After running the code above, open the file data_science_vs_data_engineer_requirements_terms.html , and you should see something like below:

Image by Author

Cool! Click here to explore the plot above yourself.

Explanations for the plot above:

  • The more blue a dot is, the more it is associated with the title Data Scientist
  • The more red a dot is, the more it is associated with the title Data Engineer
  • Terms in the bottom right corners are high in data engineer frequency and low in data scientist frequency
  • Terms in the upper left corners are high in data scientist frequency and low in data engineer frequency
  • Characteristic terms are terms that are most associated with data scientist and data engineer

Did you see the general patterns?

  • Requirements for data scientists seem to focus heavily on math, statistics, quantitative science, visualization, research, master and PhD degrees, etc.
  • Requirements for data engineers seem to focus heavily on data engineer tools such as Apache Kafka, Scala, Spark, NoSQL, Hive, Hadoop.
  • Python is common among both data scientists and data engineers

To know which documents contain a particular term, click that term in the scatter plot.

GIF by Author

Compare Between Data Scientist and Data Engineer’s Tasks

We can apply similar methods to the task column to compare the difference in tasks between data scientists and data engineers.

Below is the result:

Image by Author

You can explore the plot yourself here. Do you see the difference in terms between the requirements and tasks of these two professions?

Conclusion

Congratulations! You have just learned how to scrape job descriptions related to data engineers and data scientists using Diffbot and how to create a scatter plot to compare between these 2 jobs using scattertext.

Getting a job that fits your skills and interests requires you to understand its expectations and tasks. Since it doesn’t take much effort to get the insights like above with Diffbot and Python, why not give it a try?

Feel free to fork and play with the code for this article in this repo here:

--

--