Dinakar
Published on

SQL elegance: Crafting beautiful queries over real-world datasets

Table of Contents

This blog will be about how I have written SQL queries on a real-world dataset. Before delving deeper into nitty-gritty details, the one thing I think one should comprehend before writing SQL queries is this:

"Imagine trying to navigate a foreign city without a map. Similarly, attempting to retrieve valuable information from your data without understanding its underlying structure is like trying to find a needle in a haystack. Understanding the data and how it is stored across different tables is the key to unlocking the true power of your data. Without it, you'll wander aimlessly and waste valuable time and resources writing meaningless SQL queries”

So, In short, Understanding the underlying data is the ultimate to being able to write useful SQL queries.

Without any further ado, let’s get started!

What is SQL programming is all about?

Credits: Generated by Midjourney

Dataset I’ve dealt with

The dataset I’ve exercised on is 120 years of Olympic history: athletes and results. Basically, this is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016 and scraped by rgriffin.

In a nutshell, it consists of two .csv files. One named “athlete_events.csv” incorporated info about each row that corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The file contains 271116 rows and 15 columns. 

Excel Sheet representing each invdividual participation in olympics

Figure: Excel Sheet representing each invdividual participation in olympics

Another is named “‘noc_regions.csv” including info about noc and corresponding regions along with notes. The file contains 231 rows and 3 columns.

A Note on NOC:

A National Olympic Committee (NOC) is an organization that represents a country in the International Olympic Committee (IOC). The NOC is responsible for organizing, promoting, and developing the Olympic Movement in their respective countries, including selecting and training teams to participate in the Olympic Games. Each country has one NOC, which is recognized by the IOC. NOCs work closely with their national government and other sports organizations to promote the Olympic values and ideals within their country.

Excel Sheet representing the olympic committees

Figure: Excel Sheet representing the olympic committees

Perplexing Problems that Persisted

Problem 1: Please configure the PostgreSQL Binary Path in the Preferences dialog.

Fixed by setting proper PostgreSQL Binary path (By the way I’ve used PostgreSQL to execute SQL queries). With the help of the files present in the bin SQL queries gets executed.

Problem 2: CSV file Header error

Solved by enabling the headers option. Since the CSV file has headers as its first row.

Problem 3 :Format error in the CSV files in the field “name”

Fixed by choosing double quotes (“) as an escape character

Writing SQL Queries

Checking the Database we’re currently resided in

SELECT current_database()

Dropping (if exists) and Creating two tables to load in the csv files

DROP TABLE IF EXISTS OLYMPICS_HISTORY;
CREATE TABLE IF NOT EXISTS OLYMPICS_HISTORY
(
	id INT,
	name VARCHAR,
	sex VARCHAR,
	age VARCHAR,
	height VARCHAR,
	weight VARCHAR,
	team VARCHAR,
	noc VARCHAR,
	games VARCHAR,
	year INT,
	season VARCHAR,
	city VARCHAR,
	sport VARCHAR,
	event VARCHAR,
	medal VARCHAR
);


DROP TABLE IF EXISTS OLYMPIC_COMMITTEES;
CREATE TABLE IF NOT EXISTS OLYMPIC_COMMITTEES
(
	noc VARCHAR,
	region VARCHAR
);

Query 1: How many olympics games have been held?

Info to understand: Games happen utmost twice in an year (Winter and Summer).

SELECT COUNT(DISTINCT games) AS olympic_games FROM olympics_history;

Explanation:

Here is what each part of the query does:

  • SELECT: This keyword is used to select the columns from the table you want to retrieve data from
  • COUNT(DISTINCT games): This function counts the number of unique values in the "games" column. The DISTINCT keyword is used to make sure that the same value is not counted multiple times.
  • AS olympic_games: This is used to give an alias name to the output column resulting from the count(distinct games) function.
  • FROM olympics_history: This keyword specifies the table from which you want to retrieve data.

The query will return one row with one column, named "olympic_games", that contains the number of unique values in the "games" column of the "olympics_history" table.

It's worth noting that, the COUNT() function does not count NULL values, so if there are any NULL values in the "games" column, they will not be included in the count.

Result:

Query 2: List down all Olympics games held so far

Info: Sorting out uniquely the rows with unique values of both year and season and then sort the results by the year will yield our desired results. Note that sorting by unique city values will mislead since a city might host olympic games more than once.

select distinct year, season, city from olympics_history order by year;

Here's what each part of the query does:

  • SELECT DISTINCT: This keyword is used to select only the unique values in the "year" and "season" columns.
  • year, season, city: These are the columns that you want to retrieve data from.
  • FROM table_name: This keyword specifies the table from which you want to retrieve data.
  • ORDER BY year: This clause sorts the result set by the "year" column in ascending order.

Here’s another alternative SQL query

select year, season, city from olympics_history group by year, season, city order by year;

Miscellaneous: How GROUP BY statement works

Image about how "group by" statement works

Query 3: Mention the total no of nations who participated in each Olympics game.

select games, count(distinct noc) from olympics_history group by games;

Here’s the alternative query along with explanation:

    with all_countries as
        (select games, nr.region
        from olympics_history oh
        join olympic_committees nr ON nr.noc = oh.noc
        group by games, nr.region)
    select games, count(1) as total_countries
    from all_countries
    group by games

    order by games;

Explanation:

  • With all_countries as (...): This subquery creates a table with two columns “games” from “olympics_history” and “region” from “olympics_committees” tables respectively. Then the corresponding rows in the resultant table (“games” and “regions”) are filtered based on “noc” columns in the both tables. Next, rows are again filtered based on unique combinations of “games” and “region”
  • Select games, count(1) as total_countries from all_countries group by games;: This query first selects “games” column and “count(1)” returns the count of number of rows. The rows of the resultant table are further filtered out using unique “games” rows and corresponding count(1) changes too. And then on the result is sorted by “games” column.

Discussion: count(1) function

Query 4: Which year saw the highest and lowest no of countries participating in olympics

with all_countries as (
   select games, nr.region
	from olympics_history oh join olympic_committees nr ON oh.noc = nr.noc
	group by games, nr.region
),
total_countries_in_each_game as (
 select games, count(1) as total_countries
	from all_countries
	group by games
	order by games
)
select distinct
concat(first_value(games) over(order by total_countries_in_each_game), ' - ',
	   first_value(total_countries) over(order by total_countries_in_each_game)
	  ) as lowest_countries
,
concat(first_value(games) over(order by total_countries_in_each_game desc), ' - ',
	   first_value(total_countries) over(order by total_countries_in_each_game desc)
	  ) as highest_countries
from total_countries_in_each_game
;

Query 5: Which nation has participated in all of the Olympic games

-- total_games - subquery that creates a 1x1 table with number of games held
with total_games as (
 select count(distinct games) as total from olympics_history
),
-- The below subquery creates a table of country and its corresponding participation
-- in each olympic season
countries_and_season_they_participated as (
 select oh.games, oc.region
	from olympics_history oh join olympic_committees oc on oh.noc = oc.noc
	group by oh.games, oc.region
),
-- countries_participated creates a table with region and their corresponding count of
-- participation in olympic games
countries_participated as (
 select region, count(games) as total_participated_games
	from countries_and_season_they_participated
	group by region
)

select cp.* from countries_participated cp join total_games tg on cp.total_participated_games = tg.total;

Query 6: Identify the sport which was played in all summer Olympics

-- Count total number of games which were held in summer
with total_summer_games as (
 select count(distinct games) as total from olympics_history oh where oh.season = 'Summer'
),
-- Sport conducted in summer over the years
sports_conducted as (
 select sport, games, season from olympics_history oh where oh.season = 'Summer' group by sport, games, season
),
-- Count of sports conducted in summer over the years
sports_count as (
select sport, count(1) as no_of_games from sports_conducted group by sport
)

select * from sports_count sc join total_summer_games tsg on sc.no_of_games = tsg.total

Query 7: Which Sports were just played only once in the Olympics

-- Sport conducted over the years
with sports_conducted as (
 select sport, games from olympics_history oh group by sport, games
),
-- Count of sports conducted over the years
sports_count as (
select sport, count(1) as no_of_games from sports_conducted group by sport
)

select sc.*, sct.games
from sports_count sc join sports_conducted sct on sc.sport = sct.sport where sc.no_of_games=1
order by 1;

Query 8: Fetch the total no of sports played in each olympic games

-- Sport conducted over the years
with sports_conducted as (
 select sport, games from olympics_history oh group by sport, games
)

select games, count(1) as no_of_sports
from sports_conducted group by games order by no_of_sports desc;

Query 9: Fetch the oldest athletes to win a gold medal

-- Filter out rows which have age as 'NA'
with age_filtered as (
select name, sex, cast(case when age = 'NA' then '0' else age end as int) as age, team, games, city, sport, event, medal
from olympics_history
),
-- Rows are re-arranged and adding rank column (among gold medalists) according to the value of the age
ranked_ages as (
 select *, rank() over(order by age desc) as rnk
	from age_filtered
	where medal='Gold'
)

select * from ranked_ages where rnk=1;

Query 10: Find the Ratio of male and female athletes who participated in all Olympic games.

SELECT
    (SELECT COUNT(*) FROM olympics_history WHERE sex = 'M') as male_count,
    (SELECT COUNT(*) FROM olympics_history WHERE sex = 'F') as female_count,
    ROUND((SELECT COUNT(*) FROM olympics_history WHERE sex = 'M') / CAST((SELECT COUNT(*) FROM olympics_history WHERE sex = 'F') AS DECIMAL(10, 2)), 2) as male_to_female_ratio

Query 11: Fetch the top 5 athletes who have won the most gold medals.

select name, team, count(*) as gold_medals
from olympics_history
where medal='Gold'
group by name, team
order by gold_medals desc
limit 5;

Query 12: Fetch the top 5 athletes who have won the most medals (gold/silver/bronze).

-- select distinct medal from olympics_history;
select name, team, count(*) as total_medals
from olympics_history
where medal != 'NA'
group by name, team
order by total_medals desc
limit 5;

Query 13: Fetch the top 5 most successful countries in the Olympics. Success is defined by no of medals won.

select oc.region, count(*) as total_medals
from olympics_history oh join olympic_committees oc on oh.noc = oc.noc
where oh.medal != 'NA'
group by region
order by total_medals desc
limit 5;

Query 14: List down total gold, silver, and bronze medals won by each country.

select oc.region as country,
	SUM(CASE WHEN medal = 'Gold' THEN 1 ELSE 0 END) as gold_count,
    SUM(CASE WHEN medal = 'Silver' THEN 1 ELSE 0 END) as silver_count,
    SUM(CASE WHEN medal = 'Bronze' THEN 1 ELSE 0 END) as bronze_count
from olympics_history oh join olympic_committees oc on oh.noc = oc.noc
where oh.medal != 'NA'
group by country
-- How to sort by the sum of gold_count + silver_count + bronze_count
order by gold_count desc;

Query 15: List down total gold, silver, and bronze medals won by each country corresponding to each Olympic games.

select oc.region as country, games,
	SUM(CASE WHEN medal = 'Gold' THEN 1 ELSE 0 END) as gold_count,
    SUM(CASE WHEN medal = 'Silver' THEN 1 ELSE 0 END) as silver_count,
    SUM(CASE WHEN medal = 'Bronze' THEN 1 ELSE 0 END) as bronze_count
from olympics_history oh join olympic_committees oc on oh.noc = oc.noc
group by country, games
order by games asc;
-- Rewrite using advanced SQL

Query 16: Identify which country won the most gold, most silver, and most bronze medals in each Olympic games

with countries_medals_count as (
select oc.region as country, games,
	SUM(CASE WHEN medal = 'Gold' THEN 1 ELSE 0 END) as gold,
    SUM(CASE WHEN medal = 'Silver' THEN 1 ELSE 0 END) as silver,
    SUM(CASE WHEN medal = 'Bronze' THEN 1 ELSE 0 END) as bronze
from olympics_history oh join olympic_committees oc on oh.noc = oc.noc
group by country, games
order by games asc
)

 select distinct games
    	, concat(first_value(country) over(partition by games order by gold desc)
    			, ' - '
    			, first_value(gold) over(partition by games order by gold desc)) as Max_Gold
    	, concat(first_value(country) over(partition by games order by silver desc)
    			, ' - '
    			, first_value(silver) over(partition by games order by silver desc)) as Max_Silver
    	, concat(first_value(country) over(partition by games order by bronze desc)
    			, ' - '
    			, first_value(bronze) over(partition by games order by bronze desc)) as Max_Bronze
    from countries_medals_count
    order by games;

Query 17: Identify which country won the most gold, most silver, most bronze medals and the most medals in each olympic games.

with countries_medals_count as (
select oc.region as country, games,
	SUM(CASE WHEN medal = 'Gold' THEN 1 ELSE 0 END) as gold,
    SUM(CASE WHEN medal = 'Silver' THEN 1 ELSE 0 END) as silver,
    SUM(CASE WHEN medal = 'Bronze' THEN 1 ELSE 0 END) as bronze,
	SUM(CASE WHEN medal != 'NA' THEN 1 ELSE 0 END) as total_medals
from olympics_history oh join olympic_committees oc on oh.noc = oc.noc
group by country, games
order by games asc
)

 select distinct games
    	, concat(first_value(country) over(partition by games order by gold desc)
    			, ' - '
    			, first_value(gold) over(partition by games order by gold desc)) as Max_Gold
    	, concat(first_value(country) over(partition by games order by silver desc)
    			, ' - '
    			, first_value(silver) over(partition by games order by silver desc)) as Max_Silver
    	, concat(first_value(country) over(partition by games order by bronze desc)
    			, ' - '
    			, first_value(bronze) over(partition by games order by bronze desc)) as Max_Bronze,
		 concat(first_value(country) over(partition by games order by total_medals desc)
    			, ' - '
    			, first_value(total_medals) over(partition by games order by total_medals desc)) as Max_medals
    from countries_medals_count
    order by games;

Query 18: Which countries have never won a gold medal but have won silver/bronze medals?

with countries_medals_count as (
select oc.region as country,
	SUM(CASE WHEN medal = 'Gold' THEN 1 ELSE 0 END) as gold,
    SUM(CASE WHEN medal = 'Silver' THEN 1 ELSE 0 END) as silver,
    SUM(CASE WHEN medal = 'Bronze' THEN 1 ELSE 0 END) as bronze
from olympics_history oh join olympic_committees oc on oh.noc = oc.noc
group by country
order by gold asc
)

select * from countries_medals_count where gold = 0 and (silver > 0 or bronze > 0) order by bronze desc;

Query 19: In which Sport/event, India has won highest medals.

with india_medals_by_sport as (
select sport, team, count(1) as total_medals
from olympics_history
where team = 'India' and medal <> 'NA'
group by sport, team
order by total_medals desc
),
ranking as (
 select *, rank() over(order by total_medals desc) as rnk
	from india_medals_by_sport
)

select sport, total_medals from ranking where rnk=1;

Query 20:  Break down all Olympic games where India won a medal for Hockey and how many medals in each Olympic games

select team, sport, games, count(*) as total_medals
from olympics_history
where team = 'India' and sport ='Hockey' and medal <> 'NA'
group by team, sport, games
order by total_medals desc;

Note:

  • The use of the WITH clause should always be at the start of the query, before the main SELECT statement.
  • Using two SELECT keywords simultaneously will result in a syntax error. Instead, you can use subqueries or CTEs to achieve the same results.
  • In the ORDER BY clause, you should use the name of the column from the current table. In the WHERE clause, you should use the name of the column from the previously created table or subquery.
  • Comparison of values between rows in a table is usually done through the use of aggregate functions and GROUP BY clauses, rather than comparing each row to every other row.

Credits

Practice writing SQL queries with a real dataset