Data Science, Tutorial

A Complete Guide to SQL for Data Science

Querying Techniques Every Data Scientist Should Know

Published in

Towards AI

13 min readFeb 19, 2021

Structured Query Language (SQL) is a language designed for storing, managing, and retrieving data from a relational database. All Relational DB Systems like MySQL, MS Access, SQL Server, Oracle, and Postgres use SQL as their standard database language for performing CRUD (Create, Read, Update & Delete) operations.
Filtering and performing certain manipulations or computations at the Database level can significantly improve the performance of the application that utilizes the data. SQL is a language used by Data Scientists almost on a day-to-day basis but not spoken about as much as R or Python. Working knowledge of SQL and databases is imperative for Data Science.

A relational database has data points that are related to one another, organized into tables. Most of the RDBMS use similar syntax, with some notable differences. Some functions which are available in one could be named differently or not be available in the other.

MySQL has consistently been the most popular RDBMS according to the Stack Overflow trends. We’ll be using MySQL Workbench for demonstration purposes in this article.

*When using some of the commands explained below in another RDBMS, the results might vary.*

Creating a Schema

A Schema is a logical collection of database objects. It refers to the organization of data as a blueprint of how the database is constructed.²

A schema will be created only if no schema exists with the same name.

Syntax: CREATE {DATABASE | SCHEMA} [IF NOT EXISTS] db_name ;CREATE SCHEMA company_info;
use company_info;

Creating a Table

Tables are database objects that contain data, logically organized in rows and columns, similar to a spreadsheet. Each row represents a record, and each column represents a field in the record.²

You can find the various data types and how to use them effectively in designing databases in the MySQL tutorial documentation. When creating tables we can apply certain constraints to the fields in the table. You can find the different MySQL constraints on the javatpoint page.

Syntax: CREATE TABLE table_name (column_name column_type);# Sample query for creating industry_data table
CREATE TABLE industry_data (
cmp_id INT UNSIGNED PRIMARY KEY,
companyname VARCHAR(50) NOT NULL,
industry_type VARCHAR(50) NOT NULL,
industry_code VARCHAR(50) NOT NULL,
start_yr INT
);

Viewing the Tables in the Schema

show tables;

Viewing Table Description

desc cmp_details;

ALTER TABLE

The ALTER TABLE the statement provides the option to add, delete or modify columns of an existing table.

Syntax:
# Addition of column & Modification of column datatype
ALTER TABLE table_name ADD|MODIFY column_name column_definition;
# Change column_name & column_definition
ALTER TABLE table_name CHANGE COLUMN old_name new_name column_definition; 
# Drop column 
ALTER TABLE table_name DROP COLUMN column_name;
desc dummy;

dummy table description

# Adding new column comments
ALTER TABLE dummy
ADD comments_a varchar(50),
ADD comments_b varchar(50);

# Modify datatype of column address
ALTER TABLE dummy
MODIFY address TEXT;

# Dropping column
ALTER TABLE dummy 
DROP comments_b;

# Changing column name & datatype
ALTER TABLE dummy 
CHANGE COLUMN comments_a address_line text;

DROP TABLE

Syntax: DROP TABLE table_name;DROP TABLE dummy;

Loading Data from a CSV

*Follow the steps (in the box) below with caution. Do not try these in your Database connection at the workplace.*

show global variables like 'local_infile';
# If local_infile is not 'ON' allow local file to be loaded
set global local_infile=true;
# Edit the Connection settings - 'Advanced' Tab
OPT_LOCAL_INFILE=1

Prerequisite: A database table matching the structure of the file into which its content would be imported.

# Code to load the csv contents into a table
# Sample query for loading industry_data from csv
LOAD DATA LOCAL INFILE 'C:/SQL/industry.csv' 
INTO TABLE industry_data 
FIELDS TERMINATED BY ',' 
ENCLOSED BY '“'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

Reading Records

SELECT is used to retrieve data selected from one or more tables and can include UNION statements and subqueries.
The LIMIT command restricts the number of records retrieved by the SELECT statement.

SELECT * FROM company_info.cmp_details LIMIT 5;

SELECT * FROM company_info.cmp_revenue LIMIT 5;

SELECT * FROM company_info.industry_data LIMIT 5;

SELECT * FROM company_info.service_info LIMIT 10;

SELECT service_station_id,active_status FROM company_info.service_info;

Inserting a Record

INSERT INTO statement is used to insert new records in a table. If you are entering values for all the columns of the table, you need not specify the column names in the query. However, the order of the values should be the same as the columns in the table.⁴

Syntax: INSERT INTO table_name ( field1, field2,...fieldN )
        VALUES ( value1, value2,...valueN );INSERT INTO company_info.cmp_revenue (id, address, founder, revenue)
VALUES (1055,'130 Dummy Junction','Diana','$8.43B');
INSERT INTO company_info.cmp_revenue (id, address, founder, revenue)
VALUES (1060,'150 Dummy Junction','Fiona','');SELECT * FROM cmp_revenue where id in ('1055','1060');

Newly inserted record

For updation/deletion of records, it is recommended we specify a unique column or primary key of the table in the where clause to prevent unintended records from getting modified.

Updating Records

The UPDATE statement is used to update the data of an existing table in the database. We can update single columns as well as multiple columns using an UPDATE statement.⁷

Syntax: UPDATE table_name SET field1 = new_value1, field2 = new_value2 [WHERE clause];UPDATE company_info.cmp_revenue  
SET revenue = NULL
WHERE id in ('1055','1060');SELECT * FROM cmp_revenue where id in ('1055','1060');

Updated records

Deleting Records

The DELETE statement is used to delete existing records from a table. We can delete a single record or multiple records depending on the condition we specify in the WHERE clause.⁷

Syntax: DELETE FROM table_name WHERE condition;DELETE FROM cmp_revenue where id in ('1055','1060');SELECT * FROM cmp_revenue where id in ('1055','1060');

Records deleted

Filtering Records

The WHERE clause in MySQL is a keyword used to specify the criteria for retrieving records from the table. It always follows a SELECT, INSERT, UPDATE or DELETE statement to filter the records and perform various operations on the data.¹

Let us try these operations with SELECT statements using the tables we’ve created.

Syntax: 
SELECT select_list 
FROM table_name 
WHERE condition;SELECT * FROM service_info WHERE cmp_id = '1002';

SELECT * FROM service_info WHERE cmp_id in ('1001','1002');

SELECT * FROM cmp_details WHERE headquarters not in ('Hungary','Berlin','Banglore','California','New York','Berlin') LIMIT 10;

SELECT * FROM industry_data WHERE start_yr > 2010;

SELECT * FROM industry_data WHERE start_yr < 1980;

SELECT * FROM industry_data WHERE start_yr <= 1985;

SELECT cmp_id,companyname,industry_type FROM industry_data WHERE start_yr >= 2003;

Both the operators below will perform the same operation. Checking for values where the parameter is not equal to the value in the condition.

# Not equal operation
SELECT DISTINCT start_yr FROM industry_data WHERE start_yr!= '1979';
SELECT DISTINCT start_yr FROM industry_data WHERE start_yr<>'1979';

operator <>

SELECT * FROM industry_data 
WHERE industry_type = 'Electronic Components' 
OR cmp_id in ('1017','1021');

SELECT * FROM industry_data 
WHERE industry_type = 'Electronic Components' 
AND start_yr = '1979';

AND operator

The BETWEEN operator uses the greater than or equal and less than or equal operators.

SELECT cmp_id,companyname,industry_type,start_yr 
FROM industry_data
WHERE start_yr BETWEEN 1990 AND 2000;

SELECT * FROM industry_data
WHERE industry_type LIKE 'oil%'
OR industry_type LIKE '%pharma%'
OR industry_code LIKE '%712';

SELECT * FROM cmp_revenue 
WHERE revenue IS NULL;

IS NULL operator

SELECT * FROM cmp_revenue 
WHERE revenue IS NOT NULL
LIMIT 10;

Arranging Data

We can use the ORDER BY statement to arrange the records in ascending or descending order. By default, the records are arranged in ascending order if the preferred order is not mentioned.

Syntax :
SELECT select_list FROM table_name 
ORDER BY column1 [ASC|DESC], column2 [ASC|DESC], ...;SELECT * FROM cmp_revenue ORDER BY founder ASC;

SELECT id,address,founder FROM cmp_revenue ORDER BY id DESC;

JOIN Syntax:
SELECT column_list FROM table_1 
[LEFT|RIGHT|INNER] JOIN table_2
ON join_condition;

LEFT JOIN

The LEFT JOIN selects all the records from the left table regardless of whether matching rows exist in the right table. In case there are no matching rows from the right table, the columns of the right table in the final result set are filled with NULL for those records.³

We will now merge all the tables using the LEFT JOIN operation on the field cmp_id.

SELECT * FROM (
SELECT A.cmp_id,company_url,id,founder, 
revenue,companyname,industry_type, 
service_station_id,active_status,service_station_owner 
FROM cmp_details A
LEFT JOIN  cmp_revenue B
ON A.cmp_id=B.id
LEFT JOIN industry_data C
ON A.cmp_id=C.cmp_id
LEFT JOIN service_info D
ON A.cmp_id=D.cmp_id) T
WHERE cmp_id in (1017,1018,1019);

If you noticed, I have mentioned the table alias alongside the ‘cmp_id’ field in the SELECT query but haven’t specified it for the other columns, because there are 3 tables(cmp_details, industry_data, service_info) with the ‘cmp_id’ field. Not mentioning the table alias would cause ambiguity as to which table the ‘cmp_id’ should be derived from.

INNER JOIN

The INNER JOIN selects the records which are common to both the tables. In case there are no matching records from either one of the tables, the columns of both the tables in the final result set will be dropped for those records.⁵

SELECT * FROM industry_data A
INNER JOIN cmp_revenue B
ON A.cmp_id=B.id
WHERE A.cmp_id BETWEEN 1015 AND 1020;

We can see the record for id ‘1018’ dropped since it is available in one table but not in the other.

RIGHT JOIN

The RIGHT JOIN returns all records from the RIGHT-side table specified in the ON condition and only those rows from the other table where the join condition matches.¹

SELECT * FROM industry_data A
RIGHT JOIN service_info B
ON A.cmp_id=B.cmp_id
WHERE B.cmp_id BETWEEN 1050 AND 1052;

Here we can observe how the records on the table from the right will not be dropped even if a corresponding record is not found on the left table.

Aggregate Functions

SELECT COUNT(*) as total_companies, COUNT(DISTINCT start_yr) as unique_yrs,MIN(start_yr) as min_year, MAX(start_yr) as max_year, AVG(start_yr) as avg_start_yr
FROM industry_data;

count, distinct, min, max, avg

GROUP BY

The GROUP BY clause is used to collect data from multiple records and group the result by one or more columns. You can also use aggregate functions like COUNT, SUM, MIN, MAX, AVG, etc. on the grouped column.³

Syntax:
SELECT c1, c2,..., cn, aggregate_function(ci) 
FROM table 
WHERE conditions 
GROUP BY c1 , c2,...,cn;

Let us now find the average starting year for a set of industry types.

GROUP BY & AVG

SELECT industry_type, AVG(start_yr) FROM industry_data
WHERE industry_type in ('Natural Gas Distribution','Hotels/Resorts','Major Banks','EDP Services')
GROUP BY industry_type;

Finding the count of service stations that are active and inactive.

GROUP BY & COUNT

SELECT count(*),active_status FROM service_info
GROUP BY active_status;

HAVING

The HAVING clause is used in the SELECT statement to specify filter conditions for a group of rows or aggregates. The HAVING clause is often used with the GROUP BY clause to filter groups based on a specific condition. If the GROUP BY clause is omitted, the HAVING clause behaves like the WHERE clause.¹

Syntax:
SELECT select_list FROM table_name 
WHERE search_condition 
GROUP BY group_by_expression 
HAVING group_condition;

Let’s say we need to find the company ids that have more than 3 active service stations.

SELECT cmp_id,count(cmp_id),active_status FROM service_info
GROUP BY cmp_id,active_status
HAVING count(cmp_id) > 3 and active_status = 1;

String Functions

You can find below the table containing all the essential string functions in MySQL. For examples, please refer to the w3resource link.

UNION

The UNION operator allows us to combine two or more result sets of queries into a single result set. By default, the UNION operator removes duplicate rows even if the DISTINCT operator is not mentioned explicitly.¹

Syntax:
SELECT column_list 
UNION [DISTINCT | ALL] 
SELECT column_list 
UNION [DISTINCT | ALL] 
SELECT column_list ...SELECT * FROM cmp_details WHERE cmp_id in (1011,1013)
UNION
SELECT * FROM cmp_details WHERE cmp_id BETWEEN 1012 AND 1017
ORDER BY headquarters ASC;

SELECT * FROM cmp_details WHERE cmp_id in (1011,1013)
UNION ALL
SELECT * FROM cmp_details WHERE cmp_id BETWEEN 1012 AND 1017
ORDER BY cmp_id ASC;

CASE

The CASE expression is a control flow structure that allows us to add if-else logic to a query.¹

#Simple CASE
Syntax:
CASE value    
WHEN value1 THEN result1    
WHEN value2 THEN result2    
…    
[ELSE else_result] 
END#Searched CASE expression
Syntax:
CASE    
WHEN expression1 THEN result1    
WHEN expression2 THEN result2    
…    
[ELSE else_result] 
END

Based on the start year range, set a time_period value based on conditions.

SELECT DISTINCT start_yr,
CASE 
 WHEN start_yr <= 2000 THEN 'OLD'
 WHEN start_yr > 2000 AND start_yr < 2010 THEN 'MEDIUM'
 WHEN start_yr > 2010 THEN 'NEW'
 end as time_period
FROM
 industry_data;

Subquery

A subquery is a query nested within another query. Also, a subquery can be nested inside another subquery.¹

Display company revenue details for companies that started after 2005.

SELECT * FROM cmp_revenue
WHERE id in (
SELECT cmp_id FROM industry_data
WHERE start_yr > 2005
);

WITH - Common Table Expression

WITH clause is used to define common table expressions. A common table expression is a named temporary result set.

Syntax:
WITH cte_name (column_list) AS (
query
)
SELECT * FROM cte_name;

Unlike a derived table, a CTE can be self-referencing (a recursive CTE) or can be referenced multiple times in the same query. A CTE provides better readability and performance in comparison with a derived table.¹

WITH companies AS (
SELECT companyname,industry_type,
CASE 
 WHEN industry_type='Natural Gas Distribution' THEN 1
    WHEN industry_type='Major Pharmaceuticals' THEN 2
    ELSE 0
END AS indus_category
FROM industry_data
WHERE start_yr > 2005
) SELECT companyname 
FROM companies
WHERE indus_category > 0;

Summary

Concepts we’ve covered in this article:

Creating Schema and Tables
Alter, Drop Table — (ALTER TABLE,DROP TABLE)
Loading Data from a CSV
CRUD Operations — (INSERT INTO,SELECT,UPDATE,DELETE)
Filtering Records — (WHERE)
Arranging Data — (ORDER BY)
Joins — (INNER,LEFT,RIGHT)
Aggregate Functions — (SUM,DISTINCT,COUNT,AVG,MIN,MAX,GROUP BY,HAVING)
String Functions
UNION, Subquery, Common Table Expression (WITH)

SQL, Lisp, and Haskell are the only programming languages that I’ve seen where one spends more time thinking than typing. - Philip Greenspun

Wrapping up

Thanks to anyone who has made it this far. I hope you found this article helpful. Please share your feedback/queries in the comments. Now, it’s time for you to put your SQL skills to use. Type less, think hard! Wish you luck!

If you found this article interesting and are passionate about Data Science/Machine Learning, hit follow and feel free to add me on LinkedIn.

REFERENCES:
1. www.mysqltutorial.org
2. https://dev.mysql.com/doc/refman/8.0/en/
3. https://www.javatpoint.com/
4. https://www.w3schools.com/sql/
5. https://www.tutorialspoint.com/mysql/
6. https://www.w3resource.com/mysql/mysql-tutorials.php
7. https://www.geeksforgeeks.org/sql-tutorial/

Suggestions

Essential SQL Skills for Data Scientists in 2021 by Leihua Ye, Ph.D.

4 Tricky SQL Questions for Data Scientists in 2021 by Leihua Ye, Ph.D.

Data Science, Tutorial

A Complete Guide to SQL for Data Science

Querying Techniques Every Data Scientist Should Know

Written by Mahadev Easwar