Careers
I Analyzed 2k Data Scientist and Data Engineer Jobs and This is What I Found
Leverage your Python Skills to Understand the Skills and Requirements of your Dream Job
Motivation
Have you ever wondered what the difference in the job requirements between data scientists and data engineers is? Instead of going through many job requirements to figure that out, why not use a tool to get descriptions of all data scientist and data engineer jobs at once?
That is when Diffbot comes in handy. In this article, you will learn how to extract 2k jobs related to data scientists and data engineers in one click, and visualize the difference in keywords between these 2 jobs using a scatter plot.
What is Diffbot?
Diffbot is a tool that allows you to extract a trillion connected facts across the web in one click in less than 1 second.
To start with Diffbot, sign up for a 14-day free trial. This free trial will give you access to search the world’s largest Knowledge Graph and allow you to extract data from any webpage without creating rules.
Then click the Search option to search the Diffbot Knowledge Graph.
Once you see the search page, start with selecting an entity. Since we want to search for jobs, we will choose the entity Job:
Next, choose the jobs whose title
contains
the words data scientist
and click Search.
And you should see something like below in less than 1s!
Better yet, we can download the CSV data of the search results by clicking the CSV button in the top right of the screen:
Review the data then click Export to export the CSV file.
Repeat the step above to get the search results of the title data engineer
.
You can also access the data I downloaded from Diffbot using gdown
:
pip install gdown
Analyze and Process the Data
Start with reading the data:
Your data_scientist
DataFrame should look like below:
Messy data in different websites are organized into name
, pageUrl
, requirements
, summary
, tasks
, text
, and title
. How cool is that?
Next, we add an additional column called title
that specifies whether the job is looking for a data scientist or a data engineer, then merge the two DataFrames:
Next, filter out the jobs without any text
:
Top Page Sources
What are the most frequent page sources? Start with extracting page sources using yarl:
pip install yarl
Then visualize the top 20 pages:
Visualize Text
To get an understanding of common words in jobs related to data scientists and data engineers, we will visualize the phrases in the text
column. Start with cleaning the text using texthero:
pip install texthero
Then visualize all text using wordcloud:
pip install wordcloud
Cool! As we can expect from job descriptions related to data engineers and data scientists, phrases such as ‘data engineer’, ‘data scientist’, ‘big data’, ‘data analysis’, ‘data driven’, etc, are very common.
What is the Difference Between Data Scientist and Data Engineer’s Requirements?
What we are really interested in is the difference between the job requirements of data scientists and data engineers. That could easily be found with scattertext.
Start with installing scattertext:
pip install scattertext
Then process the requirements
column:
Build a corpus:
Next, we get the scaled F-score of each term in each category.
Get terms with the highest data scientist F-scores:
['quantitative', 'statistics', 'r', 'machine', 'analysis', 'field', 'science', 'ability', 'computer', 'mathematics', 'analytics', 'work', 'techniques', 'degree', 'research', 'engineering', 'business', 'mining', 'environment', 'time', 'knowledge', 'python', 'experience', 'problems', 'skills', 'years', 'math', 'team', 'languages', 'bachelor']
['apache', 'kafka', 'self', 'implement', 'spark', 'dimensional', 'design', 'scala', 'storm', 'excellent', 'stream', 'building', 'java', 'g', 'operating', 'management', 'flink', 'growth', 'processing', 'hands', 'others', 'mapreduce', 'aws', 'pipelines', 'sets', 'datasets', 'e', 'perfect', 'enemy', 'ambiguity']
Aha! The terms that are most associated with the data scientist title and the data engineer title look right.
Looking at the text alone is boring. Let’s visualize the difference in terms between these two job titles using a scatter plot.
After running the code above, open the file data_science_vs_data_engineer_requirements_terms.html
, and you should see something like below:
Cool! Click here to explore the plot above yourself.
Explanations for the plot above:
- The more blue a dot is, the more it is associated with the title Data Scientist
- The more red a dot is, the more it is associated with the title Data Engineer
- Terms in the bottom right corners are high in data engineer frequency and low in data scientist frequency
- Terms in the upper left corners are high in data scientist frequency and low in data engineer frequency
- Characteristic terms are terms that are most associated with data scientist and data engineer
Did you see the general patterns?
- Requirements for data scientists seem to focus heavily on math, statistics, quantitative science, visualization, research, master and PhD degrees, etc.
- Requirements for data engineers seem to focus heavily on data engineer tools such as Apache Kafka, Scala, Spark, NoSQL, Hive, Hadoop.
- Python is common among both data scientists and data engineers
To know which documents contain a particular term, click that term in the scatter plot.
Compare Between Data Scientist and Data Engineer’s Tasks
We can apply similar methods to the task
column to compare the difference in tasks between data scientists and data engineers.
Below is the result:
You can explore the plot yourself here. Do you see the difference in terms between the requirements and tasks of these two professions?
Conclusion
Congratulations! You have just learned how to scrape job descriptions related to data engineers and data scientists using Diffbot and how to create a scatter plot to compare between these 2 jobs using scattertext.
Getting a job that fits your skills and interests requires you to understand its expectations and tasks. Since it doesn’t take much effort to get the insights like above with Diffbot and Python, why not give it a try?
Feel free to fork and play with the code for this article in this repo here:
I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on LinkedIn and Twitter.
Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these: