Brief introdution to SQL

Published by Sohhom on

Structured Query Language (SQL) is an essential skill for data scientists. It allows them to interact with and manipulate large datasets stored in relational databases. Proficiency in SQL can help data scientists to clean, transform, and analyze data, enabling them to draw insights from the data they work with effectively. In this article, we will examine SQL’s essential features and provide examples of how they can be used for data manipulation in a data science context.

Basic SQL Queries for Data Manipulation

In this section, we will explore some of the basic SQL queries that are used for data manipulation:

SELECT

The SELECT statement is used to select columns from a table that meet specific criteria. For example, you may want to retrieve all rows from the “employees” table where the “job_title” column contains “Data Scientist”:

SELECT *
FROM employees
WHERE job_title = 'Data Scientist';

WHERE

The WHERE clause allows you to filter the rows returned by a SELECT statement by specifying one or more conditions that must be met. You can combine multiple conditions using AND and OR operators. For example, you can retrieve only the rows for data scientists working in California and earning more than $100,000:

SELECT *
FROM employees
WHERE job_title = 'Data Scientist' AND location = 'California' AND salary > 100000;

ORDER BY

The ORDER BY clause is used to sort the rows returned by a SELECT statement by the values in one or more columns. You can specify the sorting order as either ASC (ascending) or DESC (descending). For example:

SELECT *
FROM employees
WHERE job_title = 'Data Scientist' AND location = 'California'
ORDER BY salary DESC;

LIMIT

The LIMIT clause allows you to limit the number of rows returned by a SELECT statement to a specific number. For example, to find the top 10 highest paid data scientists in California, you would use the following query:

SELECT *
FROM employees
WHERE job_title = 'Data Scientist' AND location = 'California'
ORDER BY salary DESC
LIMIT 10;

Intermediate SQL Queries for Data Manipulation

In this section, we will explore some intermediate-level SQL queries that are used for data manipulation:

JOIN

The JOIN clause allows you to combine multiple tables based on common columns. Here is an example that shows how to retrieve a list of employee names and their project names:

SELECT employees.name, projects.project_name
FROM employees
JOIN projects ON employees.employee_id = projects.employee_id;

GROUP BY

The GROUP BY clause is used to group rows that have the same values in one or more specified columns. This is often used with aggregate functions like COUNT, SUM, AVG, MAX, or MIN to perform calculations on each group of rows. For example, to find the total salary paid to data scientists in each department:

SELECT department, SUM(salary) as total_salary
FROM employees
WHERE job_title = 'Data Scientist'
GROUP BY department;

HAVING

The HAVING clause is used to filter the results of a GROUP BY query based on the result of an aggregate function. For example, to find the departments where the total salary paid to data scientists is greater than $1,000,000:

SELECT department, SUM(salary) as total_salary
FROM employees
WHERE job_title = 'Data Scientist'
GROUP BY department
HAVING total_salary > 1000000;

UNION

The UNION operator combines the results of two or more SELECT statements into a single set of rows. To use UNION, the SELECT statements must have the same number of columns and the corresponding columns must have compatible data types. For example, to retrieve a list of job positions from two different tables:

SELECT job_title
FROM employees
UNION
SELECT job_position
FROM contractors;

Advanced SQL Queries for Data Manipulation

In this section, we will explore some advanced-level SQL queries that are used for data manipulation:

Subqueries

A subquery is a SELECT statement that is embedded within another SELECT, INSERT, UPDATE, DELETE or CREATE statement. Subqueries are used to retrieve data from one table based on data from another table. For example, to find the average salary of data scientists in a department:

SELECT department, AVG(salary) as average_salary
FROM employees
WHERE employee_id IN (SELECT employee_id FROM employees WHERE job_title = 'Data Scientist')
GROUP BY department;

Common Table Expressions (CTE)

A Common Table Expression (CTE) is a temporary result set that you can use within another SELECT, INSERT, UPDATE, DELETE, or CREATE statement. CTE can make your SQL queries more readable and modular. Here is an example to find the highest paid data scientist in each department:

WITH ds_salaries AS (
  SELECT department, MAX(salary) as top_salary
  FROM employees
  WHERE job_title = 'Data Scientist'
  GROUP BY department
)
SELECT employees.name, employees.department, ds_salaries.top_salary
FROM employees
JOIN ds_salaries ON employees.department = ds_salaries.department AND employees.salary = ds_salaries.top_salary;

CASE Statements

The CASE statement allows you to perform conditional logic in SQL queries. It can be used to format or transform data based on specific conditions. For example, to categorize employees as junior or senior data scientists based on their salary:

SELECT name, job_title,
       CASE
         WHEN salary < 100000 THEN 'Junior Data Scientist'
         ELSE 'Senior Data Scientist'
       END as category
FROM employees
WHERE job_title = 'Data Scientist';

Conclusion

SQL is a powerful language for data manipulation in data science. By mastering essential SQL queries, data scientists can work more effectively with large datasets stored in relational databases and derive insightful information. To become proficient in SQL, it is important to practice using different query constructs to solve a variety of data manipulation challenges.

Categories: Core Concepts