Top 20 SQL Interview Questions for Data Science (2026) – With Examples

A lot of people go into data science interviews expecting Python, machine learning, statistics, and maybe a few case studies.

Then SQL shows up in the first round and suddenly becomes the thing that decides whether they move forward.

That surprises many candidates, but honestly, it makes sense. Before you build dashboards, train models, or run experiments, you need data. And in most companies, that data is sitting inside relational databases. If you cannot query it properly, everything else becomes harder.

That is why SQL keeps showing up in interviews for data analysts, data scientists, machine learning engineers, and even product analytics roles.

This post covers some of the most common SQL interview questions for data science, explained in simple words with examples you can actually understand and say out loud in an interview.

Let’s start with the questions that come up again and again.


1. What is the difference between WHERE and HAVING in SQL?

This is one of the most common SQL questions and still catches people off guard.

The easiest way to remember it is this:

  • WHERE filters rows before grouping
  • HAVING filters grouped results after GROUP BY
SELECT department, COUNT(*) AS emp_count
FROM employees
WHERE salary > 50000
GROUP BY department;

Here, WHERE removes rows before the grouping happens.

SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

Here, HAVING filters the grouped output.

A good interview line is:

WHERE works on rows, HAVING works on groups.


2. Explain different types of JOINs in SQL

JOINs are everywhere in SQL interviews. If you cannot explain them clearly, it becomes obvious very quickly.

Here is the simple version:

  • INNER JOIN returns only matching rows from both tables
  • LEFT JOIN returns all rows from the left table and matching rows from the right table
  • RIGHT JOIN returns all rows from the right table and matching rows from the left table
  • FULL OUTER JOIN returns all rows from both tables, whether they match or not
SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d
ON e.dept_id = d.id;

SELECT e.name, d.department_name
FROM employees e
LEFT JOIN departments d
ON e.dept_id = d.id;<br><br>

In data science and analytics work, LEFT JOIN is very common because you usually want to keep all rows from your main table and enrich them with extra information.


3. What are window functions in SQL?

Window functions are one of those topics that help you stand out in interviews.

A window function performs a calculation across related rows without collapsing them into one row like GROUP BY does.

SELECT
    employee_name,
    department,
    salary,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS salary_rank
FROM employees;

The OVER() clause is what makes it a window function.

Some important window functions to know:

  • ROW_NUMBER()
  • RANK()
  • DENSE_RANK()
  • LAG()
  • LEAD()
  • SUM() OVER()
  • AVG() OVER()

A very practical interview answer is:

Window functions let you calculate across a group of rows while still keeping each row visible.


4. What is the difference between ROW_NUMBER(), RANK(), and DENSE_RANK()?

This question often comes right after window functions.

ROW_NUMBER() assigns a unique number to every row, even if values are identical. No ties — every row gets a different number.

RANK() assigns the same rank to tied values but skips the next rank. So if two rows tie for rank 2, the next rank is 4, not 3.

DENSE_RANK() also assigns the same rank to ties but does not skip. If two rows tie for rank 2, the next rank is 3.

SELECT
    name,
    score,
    ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
    RANK() OVER (ORDER BY score DESC) AS rank_val,
    DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank_val
FROM scores;

If scores are 100, 100, 90:

  • ROW_NUMBER() → 1, 2, 3
  • RANK() → 1, 1, 3
  • DENSE_RANK() → 1, 1, 2

Also Read: Top 20 Python Interview Questions With Answers


5. How do you find duplicate records in a table?

This is one of the most useful real-world SQL questions.

SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

That gives duplicate values.

If you want the full duplicate rows:

SELECT *
FROM users
WHERE email IN (
    SELECT email
    FROM users
    GROUP BY email
    HAVING COUNT(*) > 1
);

In interviews, this is usually enough unless they ask how to remove duplicates. If they do, you can mention ROW_NUMBER() with a CTE.


6. What is a CTE in SQL and when would you use it?

CTE stands for Common Table Expression.

It is a temporary named result set created using the WITH keyword. It makes complex queries easier to read and maintain.

WITH high_earners AS (
    SELECT employee_id, name, salary, department
    FROM employees
    WHERE salary > 80000
),
dept_counts AS (
    SELECT department, COUNT(*) AS high_earner_count
    FROM high_earners
    GROUP BY department
)
SELECT *
FROM dept_counts
ORDER BY high_earner_count DESC;

Use CTEs when:

  • a query has multiple steps
  • you want cleaner, readable SQL
  • you want to reuse intermediate logic
  • you are writing recursive queries

7. What is the difference between UNION and UNION ALL?

This is simple but commonly asked.

  • UNION combines results and removes duplicates
  • UNION ALL combines results and keeps duplicates
SELECT city FROM customers
UNION
SELECT city FROM suppliers;

SELECT city FROM customers
UNION ALL
SELECT city FROM suppliers;<br><br>

UNION ALL is faster because it skips the deduplication step.

So in interviews, a good line is:

Use UNION ALL when you want all rows and do not need duplicate removal.


8. How do you calculate a running total in SQL?

This is a very practical analytics question.

SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM daily_sales;

This creates a cumulative sum from the first row up to the current row.

Running totals are common in revenue reporting, user growth tracking, and time-series analysis.


9. What is the difference between a subquery and a JOIN?

Both can solve similar problems, but they are not always equally efficient.

A subquery runs separately and passes its result to the outer query.
A JOIN combines rows from multiple tables based on a condition.

SELECT name, salary
FROM employees
WHERE dept_id IN (
    SELECT id
    FROM departments
    WHERE location = 'New York'
);

SELECT e.name, e.salary
FROM employees e
JOIN departments d
ON e.dept_id = d.id
WHERE d.location = 'New York';<br><br>

In many real cases, JOINs perform better and are easier for query optimizers to handle.


10. How do you handle NULL values in SQL?

NULL means missing or unknown value. It is not zero, and it is not an empty string.

That is why it behaves differently.

SELECT *
FROM employees
WHERE manager_id IS NULL;

To replace NULL values:

SELECT name, COALESCE(bonus, 0) AS bonus
FROM employees;

Important rule:

Never use = NULL. Use IS NULL or IS NOT NULL.

That one line alone can save you from silly mistakes in interviews.


11. What is indexing and why does it matter?

Indexes help the database find rows faster without scanning the full table.

Think of it like the index page at the back of a book. Without it, you check every page. With it, you jump much faster to the right place.

CREATE INDEX idx_employee_dept
ON employees(department_id);

Indexes are useful on columns often used in:

  • WHERE
  • JOIN
  • ORDER BY

One thing to mention in interviews:

Indexes improve read performance, but too many indexes can slow down inserts and updates.


12. How do you find the second highest salary in SQL?

This is one of the oldest SQL interview questions, and yes, companies still ask it.

One clean method is:

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

A more flexible way is using DENSE_RANK():

SELECT salary
FROM (
    SELECT salary,
           DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
    FROM employees
) ranked
WHERE rnk = 2;

This version is nice because you can easily find the 3rd or 4th highest too.


13. What is the difference between DELETE, TRUNCATE, and DROP?

This is a very standard SQL theory question.

DELETE removes specific rows based on a WHERE condition. It’s logged and can be rolled back in a transaction. It’s slow on large tables.

TRUNCATE removes all rows from a table instantly. It’s much faster than DELETE but cannot be used with a WHERE clause. In most databases it can be rolled back.

DROP removes the entire table including its structure, indexes, and constraints. This cannot be undone.

DELETE FROM employees WHERE department = 'Sales';
TRUNCATE TABLE temp_data;
DROP TABLE old_backup;

The easiest way to explain it:

  • DELETE keeps the table
  • TRUNCATE empties the table
  • DROP removes the table completely

14. How do you pivot data in SQL?

Pivoting means turning row values into columns.

A portable way to do that is using CASE WHEN.

SELECT
    department,
    SUM(CASE WHEN month = 'Jan' THEN revenue ELSE 0 END) AS Jan,
    SUM(CASE WHEN month = 'Feb' THEN revenue ELSE 0 END) AS Feb,
    SUM(CASE WHEN month = 'Mar' THEN revenue ELSE 0 END) AS Mar
FROM monthly_revenue
GROUP BY department;

This is commonly used in analytics for reporting and summary tables.


15. What is a correlated subquery?

A correlated subquery refers to a value from the outer query, so it runs once for each row of the outer query.

SELECT e1.name, e1.salary, e1.department
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.department = e1.department
);

This finds employees whose salary is above their department average.

Correlated subqueries are powerful, but on large tables they can be slower than JOINs or window functions.


16. How do you calculate year-over-year growth in SQL?

This is a very business-oriented SQL question and extremely relevant for data roles.

WITH yearly_revenue AS (
    SELECT
        YEAR(order_date) AS year,
        SUM(revenue) AS total_revenue
    FROM orders
    GROUP BY YEAR(order_date)
)
SELECT
    year,
    total_revenue,
    LAG(total_revenue) OVER (ORDER BY year) AS prev_year_revenue,
    ROUND(
        (total_revenue - LAG(total_revenue) OVER (ORDER BY year))
        / LAG(total_revenue) OVER (ORDER BY year) * 100,
        2
    ) AS yoy_growth_pct
FROM yearly_revenue;

LAG() makes this easy because it lets you access the previous row’s value.

This same idea also works for month-over-month and week-over-week growth.


17. What is normalization and why does it matter?

Normalization is the process of organizing database tables to reduce redundancy and improve data integrity.

The common normal forms interviewers may expect you to know are:

  • 1NF – atomic values, no repeating groups
  • 2NF – every non-key column depends on the full primary key
  • 3NF – no non-key column depends on another non-key column

In day-to-day data science work, you often query denormalized analytics tables. But understanding normalization helps you understand where that data originally came from and why tables are designed a certain way.


18. How do you find customers who purchased in one year but not the next year?

This is the kind of question interviewers like because it feels close to real business work.

SELECT DISTINCT customer_id
FROM orders
WHERE YEAR(order_date) = 2025
AND customer_id NOT IN (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE YEAR(order_date) = 2026
);

A safer large-table version uses LEFT JOIN:

SELECT DISTINCT o2025.customer_id
FROM orders o2025
LEFT JOIN orders o2026
    ON o2025.customer_id = o2026.customer_id
    AND YEAR(o2026.order_date) = 2026
WHERE YEAR(o2025.order_date) = 2025
AND o2026.customer_id IS NULL;

This pattern is often used for churn and retention analysis.


19. What is the difference between OLTP and OLAP?

This is a conceptual question that comes up surprisingly often.

  • OLTP systems are built for day-to-day transactions
  • OLAP systems are built for analysis over large volumes of historical data

Examples:

  • OLTP → PostgreSQL, MySQL handling app transactions
  • OLAP → Snowflake, BigQuery, Redshift for analytics

A good interview summary is:

OLTP is for running the business. OLAP is for analyzing the business.


20. How would you optimize a slow SQL query?

This is one of the best open-ended interview questions because it shows how you think.

The first step should always be:

Check the execution plan using EXPLAIN or EXPLAIN ANALYZE.

EXPLAIN ANALYZE
SELECT e.name, d.department_name, SUM(s.amount)
FROM employees e
JOIN departments d ON e.dept_id = d.id
JOIN sales s ON e.id = s.employee_id
WHERE s.sale_date >= '2026-01-01'
GROUP BY e.name, d.department_name;

After that, common improvements include:

  • adding indexes on join and filter columns
  • avoiding SELECT *
  • filtering early
  • reducing unnecessary subqueries
  • replacing slow correlated subqueries where needed
  • checking whether joins are exploding row counts

The main thing interviewers want is not just random tuning ideas. They want to hear that you measure first, then optimize.


Quick Tips Before Your SQL Interview

A few small habits help a lot:

  • practice writing queries without autocomplete
  • revise joins, group by, window functions, and CTEs
  • explain your thinking while solving
  • be careful with NULL, duplicates, and ties
  • always think about performance when the table is large

A lot of SQL interview questions are not hard because of syntax. They are hard because people panic and stop thinking clearly.


Final Thoughts

SQL is still one of the most important skills in data science interviews because it sits right at the foundation of real work. Before models, before dashboards, before fancy machine learning pipelines, somebody has to fetch the data correctly.

That is usually where SQL comes in.

So if you are preparing for a data science interview, do not treat SQL like a side topic. Treat it like a core skill.

If you can explain joins, window functions, grouping, filtering, null handling, and query optimization in simple words, you are already in a much better position than most candidates.


FAQ Section

Is SQL important for data science interviews?

Yes. SQL is one of the most asked topics in data science interviews because most business data lives in relational databases, and companies expect candidates to query and analyze it confidently.

Are window functions important for SQL interviews?

Yes. Window functions like ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), and LEAD() are commonly asked in analyst and data science interviews.

What SQL topics should I prepare first for a data science interview?

Start with SELECT, WHERE, GROUP BY, HAVING, joins, subqueries, CTEs, window functions, NULL handling, and basic query optimization.

Do data scientists need to know query optimization?

Yes, at least at a practical level. You may not act like a database administrator, but knowing how to read slow queries, use indexes wisely, and avoid inefficient SQL is very useful.

Leave a Reply

Scroll to Top