Sensible SQL Puzzles That Will Degree Up Your Ability

There are some Sql patterns that, as soon as you already know them, you begin seeing them in every single place. The options to the puzzles that I’ll present you at the moment are literally quite simple SQL queries, however understanding the idea behind them will certainly unlock new options to the queries you write on a day-to-day foundation.

These challenges are all based mostly on real-world eventualities, as over the previous few months I made some extent of writing down each puzzle-like question that I needed to construct. I additionally encourage you to strive them for your self, with the intention to problem your self first, which is able to enhance your studying!

All queries to generate the datasets will likely be supplied in a PostgreSQL and DuckDB-friendly syntax, with the intention to simply copy and play with them. On the finish I may even present you a hyperlink to a GitHub repo containing all of the code, in addition to the reply to the bonus problem I’ll go away for you!

I organized these puzzles so as of accelerating problem, so, for those who discover the primary ones too straightforward, at the very least check out the final one, which makes use of a way that I really consider you received’t have seen earlier than.

Okay, let’s get began.

I like this puzzle due to how quick and easy the ultimate question is, despite the fact that it offers with many edge instances. The information for this problem exhibits tickets shifting in between Kanban phases, and the target is to search out how lengthy, on common, tickets keep within the Doing stage.

The information comprises the ID of the ticket, the date the ticket was created, the date of the transfer, and the “from” and “to” phases of the transfer. The phases current are New, Doing, Assessment, and Performed.

Some issues it’s good to know (edge instances):

Tickets can transfer backwards, that means tickets can return to the Doing stage.
You shouldn’t embody tickets which are nonetheless caught within the Doing stage, as there is no such thing as a approach to know the way lengthy they may keep there for.
Tickets aren’t all the time created within the New stage.

CREATE TABLE ticket_moves (
    ticket_id INT NOT NULL,
    create_date DATE NOT NULL,
    move_date DATE NOT NULL,
    from_stage TEXT NOT NULL,
    to_stage TEXT NOT NULL
);

INSERT INTO ticket_moves (ticket_id, create_date, move_date, from_stage, to_stage)
    VALUES
        -- Ticket 1: Created in "New", then strikes to Doing, Assessment, Performed.
        (1, '2024-09-01', '2024-09-03', 'New', 'Doing'),
        (1, '2024-09-01', '2024-09-07', 'Doing', 'Assessment'),
        (1, '2024-09-01', '2024-09-10', 'Assessment', 'Performed'),
        -- Ticket 2: Created in "New", then strikes: New → Doing → Assessment → Doing once more → Assessment.
        (2, '2024-09-05', '2024-09-08', 'New', 'Doing'),
        (2, '2024-09-05', '2024-09-12', 'Doing', 'Assessment'),
        (2, '2024-09-05', '2024-09-15', 'Assessment', 'Doing'),
        (2, '2024-09-05', '2024-09-20', 'Doing', 'Assessment'),
        -- Ticket 3: Created in "New", then strikes to Doing. (Edge case: no subsequent transfer from Doing.)
        (3, '2024-09-10', '2024-09-16', 'New', 'Doing'),
        -- Ticket 4: Created already in "Doing", then strikes to Assessment.
        (4, '2024-09-15', '2024-09-22', 'Doing', 'Assessment');

A abstract of the info:

Ticket 1: Created within the New stage, strikes usually to Doing, then Assessment, after which Performed.
Ticket 2: Created in New, then strikes: New → Doing → Assessment → Doing once more → Assessment.
Ticket 3: Created in New, strikes to Doing, however it’s nonetheless caught there.
Ticket 4: Created within the Doing stage, strikes to Assessment afterward.

It may be a good suggestion to cease for a bit and assume how you’d cope with this. Are you able to learn how lengthy a ticket stays on a single stage?

Truthfully, this sounds intimidating at first, and it appears to be like like it will likely be a nightmare to cope with all the sting instances. Let me present you the total answer to the issue, after which I’ll clarify what is going on afterward.

WITH stage_intervals AS (
    SELECT
        ticket_id,
        from_stage,
        move_date 
        - COALESCE(
            LAG(move_date) OVER (
                PARTITION BY ticket_id 
                ORDER BY move_date
            ), 
            create_date
        ) AS days_in_stage
    FROM
        ticket_moves
)
SELECT
    SUM(days_in_stage) / COUNT(DISTINCT ticket_id) as avg_days_in_doing
FROM
    stage_intervals
WHERE
    from_stage = 'Doing';

The primary CTE makes use of the LAG operate to search out the earlier transfer of the ticket, which would be the time the ticket entered that stage. Calculating the period is so simple as subtracting the earlier date from the transfer date.

What it is best to discover is the usage of the COALESCE within the earlier transfer date. What that does is that if a ticket doesn’t have a earlier transfer, then it makes use of the date of creation of the ticket. This takes care of the instances of tickets being created straight into the Doing stage, because it nonetheless will correctly calculate the time it took to go away the stage.

That is the results of the primary CTE, exhibiting the time spent in every stage. Discover how the Ticket 2 has two entries, because it visited the Doing stage in two separate events.

With this carried out, it’s only a matter of getting the typical because the SUM of complete days spent in doing, divided by the distinct variety of tickets that ever left the stage. Doing it this manner, as a substitute of merely utilizing the AVG, makes certain that the 2 rows for Ticket 2 get correctly accounted for as a single ticket.

Not so unhealthy, proper?

The aim of this second problem is to discover the latest contract sequence of each worker. A break of sequence occurs when two contracts have a niche of greater than in the future between them.

On this dataset, there are not any contract overlaps, that means {that a} contract for a similar worker both has a niche or ends a day earlier than the brand new one begins.

CREATE TABLE contracts (
    contract_id integer PRIMARY KEY,
    employee_id integer NOT NULL,
    start_date date NOT NULL,
    end_date date NOT NULL
);

INSERT INTO contracts (contract_id, employee_id, start_date, end_date)
VALUES 
    -- Worker 1: Two steady contracts
    (1, 1, '2024-01-01', '2024-03-31'),
    (2, 1, '2024-04-01', '2024-06-30'),
    -- Worker 2: One contract, then a niche of three days, then two contracts
    (3, 2, '2024-01-01', '2024-02-15'),
    (4, 2, '2024-02-19', '2024-04-30'),
    (5, 2, '2024-05-01', '2024-07-31'),
    -- Worker 3: One contract
    (6, 3, '2024-03-01', '2024-08-31');

As a abstract of the info:

Worker 1: Has two steady contracts.
Worker 2: One contract, then a niche of three days, then two contracts.
Worker 3: One contract.

The anticipated end result, given the dataset, is that each one contracts needs to be included aside from the primary contract of Worker 2, which is the one one which has a niche.

Earlier than explaining the logic behind the answer, I would love you to consider what operation can be utilized to affix the contracts that belong to the identical sequence. Focus solely on the second row of knowledge, what data do it’s good to know if this contract was a break or not?

I hope it’s clear that that is the right state of affairs for window capabilities, once more. They’re extremely helpful for fixing issues like this, and understanding when to make use of them helps quite a bit to find clear options to issues.

Very first thing to do, then, is to get the top date of the earlier contract for a similar worker with the LAG operate. Doing that, it’s easy to match each dates and verify if it was a break of sequence.

WITH ordered_contracts AS (
    SELECT
        *,
        LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS previous_end_date
    FROM
        contracts
),
gapped_contracts AS (
    SELECT
        *,
        -- Offers with the case of the primary contract, which will not have
        -- a earlier finish date. On this case, it is nonetheless the beginning of a brand new
        -- sequence.
        CASE WHEN previous_end_date IS NULL
            OR previous_end_date < start_date - INTERVAL '1 day' THEN
            1
        ELSE
            0
        END AS is_new_sequence
    FROM
        ordered_contracts
)
SELECT * FROM gapped_contracts ORDER BY employee_id ASC;

An intuitive approach to proceed the question is to quantity the sequences of every worker. For instance, an worker who has no hole, will all the time be on his first sequence, however an worker who had 5 breaks in contracts will likely be on his fifth sequence. Funnily sufficient, that is carried out by one other window operate.

--
-- Earlier CTEs
--
sequences AS (
    SELECT
        *,
        SUM(is_new_sequence) OVER (PARTITION BY employee_id ORDER BY start_date) AS sequence_id
FROM
    gapped_contracts
)
SELECT * FROM sequences ORDER BY employee_id ASC;

Discover how, for Worker 2, he begins his sequence #2 after the primary gapped worth. To complete this question, I grouped the info by worker, bought the worth of their most up-to-date sequence, after which did an inside be part of with the sequences to maintain solely the latest one.

--
-- Earlier CTEs
--
max_sequence AS (
    SELECT
        employee_id,
        MAX(sequence_id) AS max_sequence_id
FROM
    sequences
GROUP BY
    employee_id
),
latest_contract_sequence AS (
    SELECT
        c.contract_id,
        c.employee_id,
        c.start_date,
        c.end_date
    FROM
        sequences c
        JOIN max_sequence m ON c.sequence_id = m.max_sequence_id
            AND c.employee_id = m.employee_id
        ORDER BY
            c.employee_id,
            c.start_date
)
SELECT
    *
FROM
    latest_contract_sequence;

As anticipated, our closing result’s mainly our beginning question simply with the primary contract of Worker 2 lacking!

Lastly, the final puzzle — I’m glad you made it this far.

For me, that is essentially the most mind-blowing one, as after I first encountered this drawback I considered a totally completely different answer that might be a large number to implement in SQL.

For this puzzle, I’ve modified the context from what I needed to cope with for my job, as I feel it would make it simpler to elucidate.

Think about you’re an information analyst at an occasion venue, and also you’re analyzing the talks scheduled for an upcoming occasion. You wish to discover the time of day the place there would be the highest variety of talks taking place on the similar time.

That is what it is best to know concerning the schedules:

Rooms are booked in increments of 30min, e.g. from 9h-10h30.
The information is clear, there are not any overbookings of assembly rooms.
There will be back-to-back conferences in a single assembly room.

Assembly schedule visualized (that is the precise information).

CREATE TABLE conferences (
    room TEXT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL
);

INSERT INTO conferences (room, start_time, end_time) VALUES
    -- Room A conferences
    ('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
    ('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
    -- Room B conferences
    ('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
    -- Room C conferences
    ('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room C', '2024-10-01 11:30', '2024-10-01 12:00');

The way in which to unravel that is utilizing what known as a Sweep Line Algorithm, or often known as an event-based answer. This final title truly helps to know what will likely be carried out, as the thought is that as a substitute of coping with intervals, which is what now we have within the unique information, we cope with occasions as a substitute.

To do that, we have to remodel each row into two separate occasions. The primary occasion would be the Begin of the assembly, and the second occasion would be the Finish of the assembly.

WITH occasions AS (
  -- Create an occasion for the beginning of every assembly (+1)
  SELECT 
    start_time AS event_time, 
    1 AS delta
  FROM conferences
  UNION ALL
  -- Create an occasion for the top of every assembly (-1)
  SELECT 
   -- Small trick to work with the back-to-back conferences (defined later)
    end_time - interval '1 minute' as end_time,
    -1 AS delta
  FROM conferences
)
SELECT * FROM occasions;

Take the time to know what is going on right here. To create two occasions from a single row of knowledge, we’re merely unioning the dataset on itself; the primary half makes use of the beginning time because the timestamp, and the second half makes use of the top time.

You may already discover the delta column created and see the place that is going. When an occasion begins, we rely it as +1, when it ends, we rely it as -1. You may even be already pondering of one other window operate to unravel this, and also you’re truly proper!

However earlier than that, let me simply clarify the trick I used in the long run dates. As I don’t need back-to-back conferences to rely as two concurrent conferences, I’m subtracting a single minute of each finish date. This fashion, if a gathering ends and one other begins at 10h30, it received’t be assumed that two conferences are concurrently taking place at 10h30.

Okay, again to the question and one more window operate. This time, although, the operate of alternative is a rolling SUM.

--
-- Earlier CTEs
--
ordered_events AS (
  SELECT
    event_time,
    delta,
    SUM(delta) OVER (ORDER BY event_time, delta DESC) AS concurrent_meetings
  FROM occasions
)
SELECT * FROM ordered_events ORDER BY event_time DESC;

The rolling SUM on the Delta column is actually strolling down each file and discovering what number of occasions are energetic at the moment. For instance, at 9 am sharp, it sees two occasions beginning, so it marks the variety of concurrent conferences as two!

When the third assembly begins, the rely goes as much as three. However when it will get to 9h59 (10 am), then two conferences finish, bringing the counter again to 1. With this information, the one factor lacking is to search out when the very best worth of concurrent conferences occurs.

--
-- Earlier CTEs
--
max_events AS (
  -- Discover the utmost concurrent conferences worth
  SELECT 
    event_time, 
    concurrent_meetings,
    RANK() OVER (ORDER BY concurrent_meetings DESC) AS rnk
  FROM ordered_events
)
SELECT event_time, concurrent_meetings
FROM max_events
WHERE rnk = 1;

That’s it! The interval of 9h30–10h is the one with the most important variety of concurrent conferences, which checks out with the schedule visualization above!

This answer appears to be like extremely easy for my part, and it really works for thus many conditions. Each time you’re coping with intervals now, it is best to assume if the question wouldn’t be simpler if you considered it within the perspective of occasions.

However earlier than you progress on, and to actually nail down this idea, I wish to go away you with a bonus problem, which can also be a standard utility of the Sweep Line Algorithm. I hope you give it a strive!

Bonus problem

The context for this one continues to be the identical because the final puzzle, however now, as a substitute of looking for the interval when there are most concurrent conferences, the target is to search out unhealthy scheduling. Evidently there are overlaps within the assembly rooms, which must be listed so it may be mounted ASAP.

How would you discover out if the identical assembly room has two or extra conferences booked on the similar time? Listed here are some recommendations on how one can clear up it:

It’s nonetheless the identical algorithm.
This implies you’ll nonetheless do the UNION, however it would look barely completely different.
It’s best to assume within the perspective of every assembly room.

You should utilize this information for the problem:

CREATE TABLE meetings_overlap (
    room TEXT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL
);

INSERT INTO meetings_overlap (room, start_time, end_time) VALUES
    -- Room A conferences
    ('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
    ('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
    -- Room B conferences
    ('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
    -- Room C conferences
    ('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
    -- Overlaps with earlier assembly.
    ('Room C', '2024-10-01 09:30', '2024-10-01 12:00');

If you happen to’re within the answer to this puzzle, in addition to the remainder of the queries, verify this GitHub repo.

The primary takeaway from this weblog submit is that window capabilities are overpowered. Ever since I bought extra snug with utilizing them, I really feel that my queries have gotten a lot easier and simpler to learn, and I hope the identical occurs to you.

If you happen to’re concerned with studying extra about them, you’d most likely get pleasure from studying this different weblog submit I’ve written, the place I’m going over how one can perceive and use them successfully.

The second takeaway is that these patterns used within the challenges actually do occur in lots of different locations. You may want to search out sequences of subscriptions, buyer retention, otherwise you may want to search out overlap of duties. There are a lot of conditions when you have to to make use of window capabilities in a really related vogue to what was carried out within the puzzles.

The third factor I would like you to recollect is about this answer to utilizing occasions in addition to coping with intervals. I’ve checked out some issues I solved a very long time in the past that I may’ve used this sample on to make my life simpler, and sadly, I didn’t learn about it on the time.

I actually do hope you loved this submit and gave a shot to the puzzles your self. And I’m certain that for those who made it this far, you both realized one thing new about SQL or strengthened your data of window capabilities!

Thanks a lot for studying. You probably have questions or simply wish to get in contact with me, don’t hesitate to contact me at mtrentz.com.

All pictures by the writer except said in any other case.

Sensible SQL Puzzles That Will Degree Up Your Ability

Speed up AWS Nicely-Architected critiques with Generative AI

Innovating at velocity: BMW’s generative AI resolution for cloud incident evaluation

Innovating at velocity: BMW’s generative AI resolution for cloud incident evaluation

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts