SQL is a structured query language used to manage and query relational databases.

SQL joins combine rows from multiple tables using related columns.

Top 100+ SQL Interview Questions For Beginners

Q: What is the difference between WHERE and HAVING?

WHERE filters rows before grouping while HAVING filters grouped data after aggregation.

105Total Questions

30Easy

40Medium

35Hard

~15 min/dayRead Time

SQL Basics

SELECT, WHERE, ORDER BY, DISTINCT, NULL, Data Types, Clauses

15 Questions

What is SQL and what are its main sub-languages?

Easy

SQL (Structured Query Language) is the standard language for managing and querying relational databases. It is divided into four sub-languages:

DDL (Data Definition Language): CREATE, ALTER, DROP, TRUNCATE — define schema.
DML (Data Manipulation Language): SELECT, INSERT, UPDATE, DELETE — manipulate data.
DCL (Data Control Language): GRANT, REVOKE — control permissions.
TCL (Transaction Control Language): COMMIT, ROLLBACK, SAVEPOINT — manage transactions.

What is the order of execution of a SQL SELECT statement?

Medium

SQL clauses are written in one order but executed in a different order:

-- Written order:
SELECT → FROM → WHERE → GROUP BY → HAVING → ORDER BY → LIMIT
 
-- Execution order:
1. FROM / JOIN    (identify tables, apply joins)
2. WHERE          (filter rows before grouping)
3. GROUP BY       (group remaining rows)
4. HAVING         (filter groups)
5. SELECT         (evaluate expressions, apply DISTINCT)
6. ORDER BY       (sort result)
7. LIMIT / OFFSET (return subset)

Understanding this prevents common mistakes like referencing a SELECT alias inside a WHERE clause (it hasn't been defined yet at that stage).

What is the difference between WHERE and HAVING?

Easy

WHERE: Filters individual rows before any grouping occurs. Cannot use aggregate functions.
HAVING: Filters groups after GROUP BY. Can use aggregate functions.

-- WRONG: aggregate in WHERE
SELECT dept, COUNT(*) FROM employees
WHERE COUNT(*) > 5 GROUP BY dept;  -- Error!
 
-- CORRECT: use HAVING for aggregates
SELECT dept, COUNT(*) AS cnt
FROM employees
GROUP BY dept
HAVING COUNT(*) > 5;

What is the difference between DELETE, TRUNCATE, and DROP?

Easy

DELETE: DML — removes specific rows; can use WHERE; logged row-by-row; can be rolled back; fires triggers.
TRUNCATE: DDL — removes all rows instantly; cannot use WHERE; minimal logging; resets identity/auto-increment; usually cannot be rolled back; does not fire row-level triggers.
DROP: DDL — removes the entire table (structure + data + indexes); cannot be rolled back.

DELETE FROM orders WHERE order_date < '2020-01-01';
TRUNCATE TABLE temp_staging;
DROP TABLE old_archive;

How does NULL work in SQL, and what are common pitfalls?

Medium

NULL represents an unknown or missing value. SQL uses three-valued logic (TRUE / FALSE / UNKNOWN). Any comparison with NULL yields UNKNOWN, not TRUE or FALSE.

-- WRONG: these always return 0 rows
SELECT * FROM t WHERE col = NULL;
SELECT * FROM t WHERE col != NULL;
 
-- CORRECT
SELECT * FROM t WHERE col IS NULL;
SELECT * FROM t WHERE col IS NOT NULL;
 
-- COALESCE replaces NULL with a default
SELECT COALESCE(salary, 0) FROM employees;
 
-- NULL in aggregates: COUNT(*) counts all rows,
-- COUNT(col) ignores NULLs in that column
SELECT COUNT(*), COUNT(salary) FROM employees;

What is the difference between CHAR, VARCHAR, and TEXT?

Easy

CHAR(n): Fixed-length string; always stores exactly n bytes, padded with spaces. Fast for fixed-size data (e.g. country codes).
VARCHAR(n): Variable-length string up to n characters. Stores only the actual length + a small overhead. Best for most text fields.
TEXT: Variable-length with a very large maximum (database-dependent). Stored outside the main row page in many engines. Cannot be indexed in full in MySQL without prefix indexes.

CREATE TABLE example (
  code      CHAR(3),        -- 'US ', 'GBR'
  username  VARCHAR(50),    -- variable
  bio       TEXT            -- large free-form text
);

What does DISTINCT do and how does it differ from GROUP BY?

Easy

DISTINCT removes duplicate rows from the result set. GROUP BY groups rows and allows aggregate functions; it also effectively de-duplicates, but its primary purpose is aggregation.

-- DISTINCT: unique department names
SELECT DISTINCT dept FROM employees;
 
-- GROUP BY: unique depts + count of each
SELECT dept, COUNT(*) FROM employees GROUP BY dept;
 
-- Both produce same unique depts, but GROUP BY is
-- more powerful when you need aggregates.

What is the LIKE operator and how do wildcards work?

Easy

LIKE performs pattern matching on strings. Two wildcards:

%: Matches zero or more characters.
_: Matches exactly one character.

-- Names starting with 'A'
SELECT * FROM users WHERE name LIKE 'A%';
 
-- Names ending with 'son'
SELECT * FROM users WHERE name LIKE '%son';
 
-- 5-character names starting with 'J'
SELECT * FROM users WHERE name LIKE 'J____';
 
-- Case-insensitive: use ILIKE (PostgreSQL) or LOWER()
SELECT * FROM users WHERE LOWER(name) LIKE 'a%';

What is the difference between UNION and UNION ALL?

Easy

UNION: Combines results of two queries and removes duplicates. Incurs a sort/hash operation to deduplicate — slower.
UNION ALL: Combines results and keeps all rows including duplicates. No deduplication step — faster. Prefer it when you know results are distinct or duplicates are acceptable.

-- Same columns and compatible types required
SELECT name FROM customers
UNION
SELECT name FROM suppliers;  -- removes duplicates
 
SELECT name FROM customers
UNION ALL
SELECT name FROM suppliers;  -- keeps all, faster

What are constraints in SQL? List the main types.

Easy

Constraints enforce rules on data at the database level:

PRIMARY KEY: Uniquely identifies each row; NOT NULL + UNIQUE combined.
FOREIGN KEY: Ensures referential integrity — value must exist in the referenced table.
UNIQUE: All values in the column must be distinct (NULLs allowed, behavior varies by DB).
NOT NULL: Column cannot store NULL.
CHECK: Values must satisfy a boolean expression.
DEFAULT: Provides a default value when none is supplied.

CREATE TABLE orders (
  order_id   INT PRIMARY KEY,
  customer_id INT NOT NULL REFERENCES customers(id),
  status     VARCHAR(20) DEFAULT 'pending',
  amount     DECIMAL(10,2) CHECK (amount > 0),
  email      VARCHAR(100) UNIQUE
);

What is the difference between PRIMARY KEY and UNIQUE KEY?

Easy

PRIMARY KEY: One per table; cannot contain NULLs; clustered index by default in many databases; identifies each row uniquely.
UNIQUE KEY: Multiple allowed per table; can contain NULL (and multiple NULLs in most databases, since NULL ≠ NULL); creates a non-clustered index.

CREATE TABLE users (
  user_id  INT PRIMARY KEY,     -- one PK, no NULLs
  email    VARCHAR(100) UNIQUE, -- NULLs allowed
  ssn      CHAR(11) UNIQUE      -- multiple UNIQUE keys OK
);

What is a composite key?

Easy

A composite key is a primary (or unique) key made up of two or more columns. The combination must be unique, even if individual columns are not.

CREATE TABLE order_items (
  order_id   INT,
  product_id INT,
  quantity   INT,
  PRIMARY KEY (order_id, product_id)  -- composite PK
);
 
-- order_id=1, product_id=2 → valid
-- order_id=1, product_id=2 → duplicate! violates PK
-- order_id=1, product_id=3 → valid (different combination)

What is the IN operator and how does it compare to OR?

Easy

IN checks if a value matches any value in a list or subquery. It is equivalent to multiple OR conditions but cleaner and often optimized better by the query planner.

-- Equivalent queries:
SELECT * FROM products
WHERE category IN ('Electronics', 'Books', 'Toys');
 
SELECT * FROM products
WHERE category = 'Electronics'
   OR category = 'Books'
   OR category = 'Toys';
 
-- IN with subquery:
SELECT * FROM orders
WHERE customer_id IN (SELECT id FROM customers WHERE country = 'IN');
 
-- NOT IN pitfall: if subquery returns any NULL,
-- NOT IN returns no rows (UNKNOWN logic). Use NOT EXISTS instead.

What is the BETWEEN operator?

Easy

BETWEEN low AND high is a shorthand for >= low AND <= high. Both endpoints are inclusive. Works on numbers, dates, and strings.

-- Numeric range
SELECT * FROM products WHERE price BETWEEN 10 AND 100;
 
-- Date range (inclusive)
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';
 
-- String range (alphabetical)
SELECT * FROM employees WHERE last_name BETWEEN 'A' AND 'M';

What is the difference between a database, schema, and table?

Easy

Database: The top-level container that holds all data, objects, and configurations for a system (e.g. ecommerce_db).
Schema: A logical namespace within a database that groups related objects (tables, views, functions). In PostgreSQL a database has many schemas; in MySQL "schema" is synonymous with "database".
Table: A structured collection of rows and columns within a schema — the actual data storage unit.

-- PostgreSQL hierarchy:
-- Server → Database → Schema → Table
CREATE SCHEMA sales;
CREATE TABLE sales.orders (...);
CREATE TABLE sales.returns (...);

Joins

INNER, LEFT, RIGHT, FULL, CROSS, SELF, NATURAL JOIN

15 Questions

What are the different types of JOINs in SQL?

Easy

INNER JOIN: Returns only rows where the join condition matches in both tables.
LEFT (OUTER) JOIN: All rows from the left table; matching rows from right; NULLs for non-matches.
RIGHT (OUTER) JOIN: All rows from the right table; matching rows from left; NULLs for non-matches.
FULL (OUTER) JOIN: All rows from both tables; NULLs where no match on either side.
CROSS JOIN: Cartesian product — every row of left paired with every row of right. No join condition.
SELF JOIN: A table joined to itself, using aliases. Used for hierarchical data.

Write a query using INNER JOIN.

Easy

INNER JOIN returns rows that have matching values in both tables.

-- Employees with their department names
SELECT e.employee_id, e.name, d.department_name
FROM employees e
INNER JOIN departments d
  ON e.department_id = d.department_id;
 
-- Only employees who belong to an existing department are returned.
-- Employees with no department (NULL dept_id) are excluded.

What is a LEFT JOIN and when would you use it?

Easy

Use a LEFT JOIN when you want all records from the left table regardless of whether there is a matching record on the right. Unmatched right-side columns appear as NULL.

-- All customers, even those with no orders
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name;
 
-- Find customers with NO orders (anti-join pattern)
SELECT c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;

What is a FULL OUTER JOIN?

Medium

FULL OUTER JOIN returns all rows from both tables. Where there is no match, NULLs fill the columns of the side that has no matching row.

SELECT e.name, d.department_name
FROM employees e
FULL OUTER JOIN departments d
  ON e.department_id = d.department_id;
 
-- Rows with employees but no dept: dept_name = NULL
-- Rows with depts but no employee: e.name = NULL
 
-- MySQL does NOT support FULL OUTER JOIN natively.
-- Emulate with UNION:
SELECT e.name, d.department_name FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id
UNION
SELECT e.name, d.department_name FROM employees e
RIGHT JOIN departments d ON e.department_id = d.department_id;

What is a CROSS JOIN?

Easy

A CROSS JOIN produces the Cartesian product — every row from the left table is combined with every row from the right table. If left has 5 rows and right has 3, result has 15 rows. Use with care; large tables cause huge results.

-- Generate all size-color combinations
SELECT s.size, c.color
FROM sizes s
CROSS JOIN colors c;
-- sizes(S,M,L) × colors(Red,Blue) = 6 rows
 
-- Implicit CROSS JOIN (old syntax, avoid)
SELECT * FROM a, b; -- same as CROSS JOIN

What is a SELF JOIN and give a real-world example.

Medium

A self join joins a table to itself using aliases. Common for hierarchical data (employee-manager relationships) or comparing rows within the same table.

-- employees table: (id, name, manager_id)
-- manager_id references employees.id
 
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
 
-- Find employees earning more than their manager
SELECT e.name, e.salary, m.salary AS mgr_salary
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;

What is the difference between ON and USING in a JOIN?

Medium

ON: Specifies the join condition explicitly. Works with any column names, even if different. Supports complex conditions.
USING (col): Shorthand when both tables share the exact same column name. The joined column appears once in the result.

-- ON: explicit, flexible
SELECT * FROM employees e
JOIN departments d ON e.department_id = d.department_id;
 
-- USING: shorthand (both tables have 'department_id')
SELECT * FROM employees
JOIN departments USING (department_id);
 
-- USING merges the column — no prefix needed
-- ON keeps both columns — must qualify e.dept_id / d.dept_id

How do you find records present in one table but not in another?

Medium

Three common approaches — the anti-join pattern:

-- 1. LEFT JOIN + IS NULL (most portable)
SELECT a.id FROM table_a a
LEFT JOIN table_b b ON a.id = b.id
WHERE b.id IS NULL;
 
-- 2. NOT EXISTS (often best optimizer plan)
SELECT id FROM table_a a
WHERE NOT EXISTS (
  SELECT 1 FROM table_b b WHERE b.id = a.id
);
 
-- 3. NOT IN (AVOID if subquery can return NULLs)
SELECT id FROM table_a
WHERE id NOT IN (SELECT id FROM table_b WHERE id IS NOT NULL);
 
-- EXCEPT / MINUS (set operation, some databases)
SELECT id FROM table_a
EXCEPT
SELECT id FROM table_b;

What causes a Cartesian product / row explosion in JOINs?

Medium

A Cartesian product occurs when either the join condition is missing or the relationship is one-to-many/many-to-many and you don't aggregate properly.

-- Missing ON clause → implicit CROSS JOIN
SELECT * FROM employees, departments;  -- 100 × 10 = 1000 rows!
 
-- Many-to-many without aggregation: duplicates
-- orders (1 customer, many orders) joined with
-- order_items (many items per order):
SELECT c.name, SUM(oi.amount)
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN order_items oi ON o.id = oi.order_id
GROUP BY c.name;  -- GROUP BY prevents fan-out inflation
 
-- Always verify JOIN result row count matches expectations.

What is a NATURAL JOIN?

Medium

A NATURAL JOIN automatically joins tables on all columns with the same name. It is convenient but dangerous in production — adding or renaming columns can silently change join behavior.

-- Automatically joins on 'department_id' (shared column)
SELECT * FROM employees NATURAL JOIN departments;
 
-- Equivalent to:
SELECT * FROM employees
JOIN departments USING (department_id);
 
-- Avoid in production code; be explicit with ON or USING.

How do you join more than two tables?

Easy

Chain multiple JOIN clauses. The database optimizer determines the optimal join order; the written order is just for readability.

SELECT
  c.name        AS customer,
  o.order_id,
  p.product_name,
  oi.quantity
FROM customers c
JOIN orders o        ON c.id = o.customer_id
JOIN order_items oi  ON o.id = oi.order_id
JOIN products p      ON oi.product_id = p.id
WHERE o.order_date >= '2024-01-01'
ORDER BY o.order_id;

Find the second highest salary using a JOIN / subquery approach.

Medium

-- Method 1: Subquery
SELECT MAX(salary) AS second_highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
 
-- Method 2: DENSE_RANK window function (most robust)
SELECT salary FROM (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
  FROM employees
) ranked
WHERE rnk = 2
LIMIT 1;
 
-- Method 3: OFFSET (simple but fragile with ties)
SELECT DISTINCT salary FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

What is a non-equi join?

Medium

A non-equi join uses a condition other than equality (=) in the ON clause, such as <, >, BETWEEN.

-- Assign salary grade based on range table
CREATE TABLE salary_grades (
  grade CHAR(1), low INT, high INT
);
 
SELECT e.name, e.salary, sg.grade
FROM employees e
JOIN salary_grades sg
  ON e.salary BETWEEN sg.low AND sg.high;
 
-- Find all employees whose salary is above the avg
-- of anyone hired before them (historical comparison)
SELECT e1.name, e1.salary
FROM employees e1
JOIN employees e2 ON e2.hire_date < e1.hire_date
GROUP BY e1.id, e1.name, e1.salary
HAVING e1.salary > AVG(e2.salary);

What is the difference between INTERSECT and EXCEPT?

Medium

INTERSECT: Returns rows present in both result sets (like set intersection). Removes duplicates.
EXCEPT (MINUS in Oracle): Returns rows from the first result set that are not in the second.

-- Customers who are also suppliers
SELECT name FROM customers
INTERSECT
SELECT name FROM suppliers;
 
-- Customers who are NOT suppliers
SELECT name FROM customers
EXCEPT
SELECT name FROM suppliers;
 
-- Both require same number of columns and compatible types.

How would you get duplicate rows in a table?

Medium

-- Find duplicate emails in users table
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
 
-- Show all rows that are duplicates (with row details)
SELECT *
FROM users
WHERE email IN (
  SELECT email FROM users
  GROUP BY email
  HAVING COUNT(*) > 1
)
ORDER BY email;
 
-- Delete duplicates, keep one (using CTE + ROW_NUMBER)
WITH cte AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
  FROM users
)
DELETE FROM cte WHERE rn > 1;

Aggregation & GROUP BY

COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ROLLUP, CUBE

10 Questions

What are the main SQL aggregate functions?

Easy

COUNT(*): Total rows; COUNT(col): Non-NULL values in column.
SUM(col): Total of numeric values (ignores NULLs).
AVG(col): Mean of numeric values (ignores NULLs).
MIN(col) / MAX(col): Smallest / largest value.
GROUP_CONCAT / STRING_AGG: Concatenates values within a group.

SELECT
  COUNT(*)              AS total_rows,
  COUNT(salary)         AS non_null_salaries,
  SUM(salary)           AS payroll,
  AVG(salary)           AS avg_salary,
  MIN(salary)           AS min_salary,
  MAX(salary)           AS max_salary
FROM employees;

What is the rule for columns in GROUP BY?

Medium

Every column in the SELECT list must either be part of the GROUP BY clause or wrapped in an aggregate function. Violating this causes an error in standard SQL (MySQL with ONLY_FULL_GROUP_BY mode disabled is an exception but returns non-deterministic values).

-- CORRECT
SELECT dept, job_title, COUNT(*) AS cnt
FROM employees
GROUP BY dept, job_title;   -- all non-aggregated cols in GROUP BY
 
-- ERROR in standard SQL:
SELECT dept, name, COUNT(*)  -- 'name' not in GROUP BY
FROM employees
GROUP BY dept;

What is the difference between COUNT(*), COUNT(1), and COUNT(col)?

Medium

COUNT(*): Counts all rows, including those with NULLs in any column.
COUNT(1): Functionally identical to COUNT(*) — modern optimizers treat them the same.
COUNT(col): Counts only rows where col is NOT NULL.

-- employees: 100 rows, 20 have salary = NULL
SELECT
  COUNT(*)       AS all_rows,    -- 100
  COUNT(1)       AS also_all,    -- 100
  COUNT(salary)  AS has_salary   -- 80
FROM employees;

What is ROLLUP and how does it differ from GROUP BY?

Hard

GROUP BY ROLLUP(a, b) produces subtotals and a grand total in addition to the regular group rows. It creates a hierarchy: (a,b), (a), ().

SELECT region, dept, SUM(sales) AS total
FROM sales_data
GROUP BY ROLLUP(region, dept);
 
-- Result includes:
-- (region, dept) → detail rows
-- (region, NULL) → subtotal per region
-- (NULL,   NULL) → grand total
 
-- Use GROUPING() to distinguish NULL-subtotal from real NULL:
SELECT
  GROUPING(region) AS is_region_total,
  region, dept, SUM(sales)
FROM sales_data
GROUP BY ROLLUP(region, dept);

What is CUBE in SQL?

Hard

GROUP BY CUBE(a, b) generates subtotals for all combinations of the listed columns (2^n combinations). For CUBE(a,b): (a,b), (a), (b), ().

SELECT region, dept, SUM(sales)
FROM sales_data
GROUP BY CUBE(region, dept);
 
-- Produces:
-- (region, dept)  → all detail combinations
-- (region, NULL)  → subtotal by region
-- (NULL,   dept)  → subtotal by dept  ← ROLLUP does NOT do this
-- (NULL,   NULL)  → grand total
 
-- ROLLUP: hierarchical (one-way drill)
-- CUBE: all cross-combinations (multidimensional analysis)

How do you calculate a running total in SQL?

Hard

-- Using a window function (modern, recommended)
SELECT
  order_date,
  amount,
  SUM(amount) OVER (ORDER BY order_date
                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
                   ) AS running_total
FROM orders;
 
-- Partitioned running total (per customer)
SELECT
  customer_id,
  order_date,
  amount,
  SUM(amount) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
  ) AS customer_running_total
FROM orders;

How do you pivot rows into columns in SQL?

Hard

-- Using conditional aggregation (most portable)
SELECT
  dept,
  SUM(CASE WHEN year = 2022 THEN sales END) AS sales_2022,
  SUM(CASE WHEN year = 2023 THEN sales END) AS sales_2023,
  SUM(CASE WHEN year = 2024 THEN sales END) AS sales_2024
FROM dept_sales
GROUP BY dept;
 
-- SQL Server native PIVOT:
SELECT dept, [2022], [2023], [2024]
FROM dept_sales
PIVOT (SUM(sales) FOR year IN ([2022],[2023],[2024])) AS pvt;

What is GROUPING SETS?

Hard

GROUPING SETS lets you specify exactly which grouping combinations you want, without computing all of them like CUBE does.

-- Equivalent to UNION of separate GROUP BYs but more efficient
SELECT region, dept, SUM(sales)
FROM sales_data
GROUP BY GROUPING SETS (
  (region, dept),   -- detail
  (region),         -- subtotal by region only
  ()                -- grand total
);
-- Does NOT compute (dept) alone — unlike CUBE

How do you find the department with the highest average salary?

Medium

-- Method 1: ORDER BY + LIMIT
SELECT dept, AVG(salary) AS avg_sal
FROM employees
GROUP BY dept
ORDER BY avg_sal DESC
LIMIT 1;
 
-- Method 2: Subquery (handles ties)
SELECT dept, AVG(salary) AS avg_sal
FROM employees
GROUP BY dept
HAVING AVG(salary) = (
  SELECT MAX(avg_sal)
  FROM (SELECT AVG(salary) AS avg_sal FROM employees GROUP BY dept) t
);
 
-- Method 3: RANK() window function (handles ties cleanly)
SELECT dept, avg_sal FROM (
  SELECT dept,
         AVG(salary) AS avg_sal,
         RANK() OVER (ORDER BY AVG(salary) DESC) AS rnk
  FROM employees GROUP BY dept
) t WHERE rnk = 1;

What is COUNT(DISTINCT col)?

Easy

COUNT(DISTINCT col) counts the number of unique non-NULL values in a column. Useful for metrics like unique active users, distinct products sold, etc.

-- How many unique customers placed orders?
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;
 
-- Unique products per category
SELECT category, COUNT(DISTINCT product_id) AS unique_products
FROM products
GROUP BY category;
 
-- Note: DISTINCT inside COUNT can be slow on large tables.
-- Approximate alternatives: HyperLogLog, COUNT_APPROX (BigQuery)

Subqueries & CTEs

Correlated Subqueries, EXISTS, WITH, Recursive CTEs

10 Questions

What is a subquery and what are its types?

Medium

A subquery (inner query / nested query) is a SELECT statement embedded inside another SQL statement. Types:

Scalar subquery: Returns a single value (one row, one column).
Row subquery: Returns a single row with multiple columns.
Table subquery (derived table): Returns multiple rows used as a temporary table in FROM.
Correlated subquery: References columns from the outer query; re-executed for each outer row.

-- Scalar: employees earning above average
SELECT name FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
 
-- Derived table
SELECT dept, avg_sal FROM (
  SELECT dept, AVG(salary) AS avg_sal FROM employees GROUP BY dept
) dept_avg WHERE avg_sal > 50000;

What is a correlated subquery?

Hard

A correlated subquery references a column from the outer query. It is executed once per row of the outer query, making it potentially slow. Can often be rewritten as a JOIN for better performance.

-- Find employees earning more than the avg of their OWN dept
SELECT name, salary, dept
FROM employees e_outer
WHERE salary > (
  SELECT AVG(salary)
  FROM employees e_inner
  WHERE e_inner.dept = e_outer.dept  -- correlation
);
 
-- Equivalent (faster) JOIN approach:
SELECT e.name, e.salary, e.dept
FROM employees e
JOIN (SELECT dept, AVG(salary) AS avg_sal
      FROM employees GROUP BY dept) dept_avg
  ON e.dept = dept_avg.dept
WHERE e.salary > dept_avg.avg_sal;

What is the difference between EXISTS and IN?

Hard

IN: Evaluates the subquery once, builds a list, then checks membership. Returns FALSE for non-matches, UNKNOWN if the list contains NULL.
EXISTS: Checks whether the subquery returns any row at all (short-circuits at the first match). Safer with NULLs. Usually preferred for correlated subqueries.

-- IN (subquery evaluated once, result materialised)
SELECT * FROM customers
WHERE id IN (SELECT customer_id FROM orders WHERE amount > 1000);
 
-- EXISTS (stops at first match — often faster)
SELECT * FROM customers c
WHERE EXISTS (
  SELECT 1 FROM orders o
  WHERE o.customer_id = c.id AND o.amount > 1000
);
 
-- NOT IN danger: if orders.customer_id has any NULL,
-- NOT IN returns 0 rows. NOT EXISTS handles NULLs correctly.

What is a CTE (Common Table Expression)?

Medium

A CTE (introduced with the WITH keyword) is a named temporary result set scoped to a single statement. It improves readability and can be referenced multiple times within the same query.

WITH high_earners AS (
  SELECT id, name, salary, dept
  FROM employees
  WHERE salary > 80000
),
dept_counts AS (
  SELECT dept, COUNT(*) AS cnt
  FROM high_earners
  GROUP BY dept
)
SELECT h.name, h.salary, d.cnt AS high_earners_in_dept
FROM high_earners h
JOIN dept_counts d ON h.dept = d.dept
ORDER BY h.salary DESC;

CTEs are not inherently materialized in all databases — some treat them as inline views. Use MATERIALIZED hint in PostgreSQL to force caching.

What is a recursive CTE and when would you use it?

Hard

A recursive CTE references itself, enabling traversal of hierarchical or graph data (org charts, bill-of-materials, category trees). Structure: anchor member UNION ALL recursive member.

-- Org chart: find all reports under manager_id = 1
WITH RECURSIVE org_tree AS (
  -- Anchor: start with the root
  SELECT id, name, manager_id, 1 AS depth
  FROM employees
  WHERE manager_id IS NULL   -- or WHERE id = 1
 
  UNION ALL
 
  -- Recursive: join children to current level
  SELECT e.id, e.name, e.manager_id, ot.depth + 1
  FROM employees e
  JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT * FROM org_tree ORDER BY depth, name;

What is a derived table?

Medium

A derived table is a subquery used in the FROM clause, given an alias. It behaves like a temporary table for the duration of the query. CTEs are often cleaner alternatives.

SELECT dept, avg_sal, RANK() OVER (ORDER BY avg_sal DESC) AS rnk
FROM (
  SELECT dept, AVG(salary) AS avg_sal
  FROM employees
  GROUP BY dept
) AS dept_summary       -- derived table alias (required in MySQL)
WHERE avg_sal > 50000;

How do you use ALL and ANY with subqueries?

Medium

ANY / SOME: Condition is TRUE if it holds for at least one value in the subquery.
ALL: Condition is TRUE only if it holds for every value in the subquery.

-- Salary greater than ANY salary in 'Sales' dept
SELECT name FROM employees
WHERE salary > ANY (
  SELECT salary FROM employees WHERE dept = 'Sales'
);
-- equivalent to: salary > MIN(sales_salaries)
 
-- Salary greater than ALL salaries in 'Sales' dept
SELECT name FROM employees
WHERE salary > ALL (
  SELECT salary FROM employees WHERE dept = 'Sales'
);
-- equivalent to: salary > MAX(sales_salaries)

What is the difference between CTE and a view?

Medium

CTE: Temporary, exists only for the duration of a single query. Not stored in the database. Defined inline with WITH.
View: Persistent database object stored in the schema. Reusable across many queries and sessions. Can be granted permissions independently.

-- CTE: one-time use
WITH ranked AS (SELECT *, ROW_NUMBER() OVER ... FROM t)
SELECT * FROM ranked WHERE rn = 1;
 
-- View: persisted, reusable
CREATE VIEW top_customers AS
SELECT customer_id, SUM(amount) AS total
FROM orders GROUP BY customer_id HAVING SUM(amount) > 10000;
 
SELECT * FROM top_customers; -- reusable any time

Write a query to find the Nth highest salary.

Medium

-- Using DENSE_RANK (handles ties, most robust)
-- Replace 3 with N
SELECT salary FROM (
  SELECT salary,
         DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
  FROM employees
) t
WHERE rnk = 3
LIMIT 1;
 
-- Using LIMIT/OFFSET (no tie handling)
SELECT DISTINCT salary FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 2;  -- OFFSET = N-1
 
-- Using correlated subquery (N=3)
SELECT DISTINCT salary FROM employees e1
WHERE 2 = (
  SELECT COUNT(DISTINCT salary) FROM employees e2
  WHERE e2.salary > e1.salary
);

When should you use a CTE vs a subquery?

Medium

Use a CTE when: the logic is complex and reused multiple times in the same query; you want readable, step-by-step decomposition; you need recursion.
Use a subquery when: it is simple and used only once; you want a compact, self-contained expression (e.g. inline scalar subquery in SELECT).

-- Subquery (fine for simple one-off use):
SELECT * FROM employees
WHERE dept_id = (SELECT id FROM departments WHERE name = 'Sales');
 
-- CTE (better for reuse and readability):
WITH sales_dept AS (SELECT id FROM departments WHERE name = 'Sales'),
     sales_emps AS (SELECT * FROM employees WHERE dept_id IN (SELECT id FROM sales_dept))
SELECT name, salary FROM sales_emps WHERE salary > 60000;

DDL / DML / DCL

CREATE, ALTER, INSERT, UPDATE, MERGE, GRANT, Normalization

10 Questions

What is database normalization? Explain 1NF, 2NF, 3NF.

Hard

Normalization organizes data to reduce redundancy and improve integrity.

1NF: Atomic values (no repeating groups / arrays in a cell); each column has a single type; each row is unique.
2NF: Must be in 1NF + every non-key attribute is fully functionally dependent on the entire primary key (no partial dependency; relevant for composite keys).
3NF: Must be in 2NF + no transitive dependencies (non-key attribute must not depend on another non-key attribute).
BCNF: Stronger 3NF — every determinant must be a candidate key.

-- 3NF violation: city determines zip_code
-- Split: address(id, street, zip) + zip_city(zip, city)
-- Now zip_code → city is in its own table, no transitive dep.

What is denormalization and when is it appropriate?

Medium

Denormalization intentionally introduces redundancy to improve read performance. Common in analytics, data warehouses, and reporting layers where JOINs are expensive.

-- Normalized (OLTP):
orders(id, customer_id)  →  customers(id, name, city)
 
-- Denormalized (OLAP / reporting):
orders(id, customer_id, customer_name, customer_city)
-- No join needed for common reports; read is faster.
-- Trade-off: updates must propagate to all copies.
 
-- Typical patterns: summary tables, pre-aggregated columns,
-- embedding frequently-joined dimensions.

What is the MERGE (UPSERT) statement?

Hard

MERGE (SQL standard) performs an UPSERT — INSERT if a row doesn't exist, UPDATE if it does — in a single atomic statement. MySQL uses INSERT ... ON DUPLICATE KEY UPDATE or REPLACE INTO.

-- Standard MERGE (SQL Server, PostgreSQL, Oracle)
MERGE INTO target_table t
USING source_table s ON t.id = s.id
WHEN MATCHED THEN
  UPDATE SET t.name = s.name, t.salary = s.salary
WHEN NOT MATCHED THEN
  INSERT (id, name, salary) VALUES (s.id, s.name, s.salary)
WHEN NOT MATCHED BY SOURCE THEN
  DELETE;   -- optional: remove rows not in source
 
-- PostgreSQL ON CONFLICT (simpler upsert)
INSERT INTO employees (id, name, salary)
VALUES (1, 'Alice', 90000)
ON CONFLICT (id) DO UPDATE SET salary = EXCLUDED.salary;

How does ALTER TABLE work? What can you modify?

Easy

-- Add a column
ALTER TABLE employees ADD COLUMN phone VARCHAR(20);
 
-- Rename a column (PostgreSQL/MySQL 8+)
ALTER TABLE employees RENAME COLUMN phone TO mobile;
 
-- Modify data type (MySQL)
ALTER TABLE employees MODIFY COLUMN salary DECIMAL(12,2);
 
-- Change column definition (SQL Server)
ALTER TABLE employees ALTER COLUMN salary DECIMAL(12,2) NOT NULL;
 
-- Drop a column
ALTER TABLE employees DROP COLUMN phone;
 
-- Add a constraint
ALTER TABLE employees
ADD CONSTRAINT fk_dept FOREIGN KEY (dept_id)
REFERENCES departments(id);
 
-- Drop a constraint
ALTER TABLE employees DROP CONSTRAINT fk_dept;

What is the difference between GRANT and REVOKE?

Easy

GRANT: Gives a user or role specific privileges on a database object.
REVOKE: Removes previously granted privileges.

-- Grant SELECT and INSERT on a table to a user
GRANT SELECT, INSERT ON employees TO 'analyst_user';
 
-- Grant all privileges
GRANT ALL PRIVILEGES ON DATABASE mydb TO admin_user;
 
-- Grant with ability to re-grant (WITH GRANT OPTION)
GRANT SELECT ON orders TO reporter WITH GRANT OPTION;
 
-- Revoke INSERT
REVOKE INSERT ON employees FROM 'analyst_user';
 
-- Roles (modern approach)
CREATE ROLE read_only;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO read_only;
GRANT read_only TO 'analyst_user';

What is INSERT INTO ... SELECT?

Easy

Inserts the results of a SELECT query into a table. Useful for copying data, archiving, ETL, and bulk loading.

-- Copy rows from one table to another
INSERT INTO archived_orders (id, customer_id, amount, order_date)
SELECT id, customer_id, amount, order_date
FROM orders
WHERE order_date < '2023-01-01';
 
-- Create and populate a new table (SELECT INTO / CREATE TABLE AS)
-- SQL Server:
SELECT * INTO orders_backup FROM orders;
 
-- PostgreSQL / MySQL:
CREATE TABLE orders_backup AS SELECT * FROM orders;

What is a surrogate key vs a natural key?

Medium

Natural key: A key derived from real-world data (e.g. SSN, email, ISBN). Has business meaning but can change or be shared.
Surrogate key: An artificial key with no business meaning (e.g. auto-increment integer, UUID). Stable, compact, and easy to join on. Preferred for most production schemas.

-- Natural key (email could change — fragile as PK)
CREATE TABLE users (email VARCHAR(100) PRIMARY KEY, name TEXT);
 
-- Surrogate key (stable, small, fast joins)
CREATE TABLE users (
  user_id  SERIAL PRIMARY KEY,   -- surrogate
  email    VARCHAR(100) UNIQUE,   -- natural, used for lookup
  name     TEXT
);

What is referential integrity and what are CASCADE options?

Medium

Referential integrity ensures a foreign key value always refers to a valid primary key in the parent table. ON DELETE and ON UPDATE define what happens when the parent row changes.

CREATE TABLE orders (
  order_id    INT PRIMARY KEY,
  customer_id INT,
  FOREIGN KEY (customer_id) REFERENCES customers(id)
    ON DELETE CASCADE    -- delete orders when customer deleted
    ON UPDATE CASCADE    -- update FK when PK changes
);
 
-- Options:
-- CASCADE    : propagate the change/delete
-- SET NULL   : set FK to NULL
-- SET DEFAULT: set FK to its default value
-- RESTRICT   : prevent parent deletion if children exist
-- NO ACTION  : like RESTRICT but checked deferred (standard SQL)

What is the difference between a star schema and a snowflake schema?

Hard

Star Schema: Central fact table surrounded by denormalized dimension tables. Fewer joins, simpler queries, faster reads. Used in data warehouses.
Snowflake Schema: Dimensions are normalized into sub-dimensions. More tables, more joins, saves storage, avoids redundancy. Better for complex hierarchies.

-- Star schema (denormalized)
fact_sales(sale_id, date_id, product_id, customer_id, amount)
dim_product(product_id, name, category, brand)  -- all in one table
 
-- Snowflake schema (normalized)
dim_product(product_id, name, category_id)
dim_category(category_id, category_name, brand_id)
dim_brand(brand_id, brand_name)  -- normalized sub-dimension

What is the UPDATE FROM syntax?

Medium

Updating rows in one table using values from another table. Syntax varies by database.

-- PostgreSQL / SQL Server: UPDATE with FROM
UPDATE employees e
SET salary = e.salary * 1.10
FROM departments d
WHERE e.dept_id = d.id
  AND d.name = 'Engineering';
 
-- MySQL: UPDATE with JOIN
UPDATE employees e
JOIN departments d ON e.dept_id = d.id
SET e.salary = e.salary * 1.10
WHERE d.name = 'Engineering';
 
-- Standard SQL (works everywhere):
UPDATE employees
SET salary = salary * 1.10
WHERE dept_id = (SELECT id FROM departments WHERE name = 'Engineering');

Indexes & Performance

B-Tree, Clustered, Covering, EXPLAIN, Query Optimization

10 Questions

What is a database index and how does it work?

Medium

An index is a separate data structure (usually a B-tree) that stores a sorted subset of column values with pointers to corresponding table rows. It speeds up lookups, sorts, and joins at the cost of extra storage and slower writes.

-- Without index: full table scan O(n)
-- With index:    B-tree lookup O(log n)
 
CREATE INDEX idx_emp_salary ON employees(salary);
CREATE INDEX idx_emp_dept_salary ON employees(dept, salary); -- composite
 
-- B-tree: default, best for range queries and equality
-- Hash: equality only (fast), no range support
-- GIN: full-text search, arrays (PostgreSQL)
-- GiST: geometric, range types
 
-- Check existing indexes
SHOW INDEX FROM employees;  -- MySQL
\d employees                 -- PostgreSQL

What is the difference between a clustered and non-clustered index?

Hard

Clustered Index: The table data is physically sorted and stored in the order of the index key. One per table. In InnoDB (MySQL), the PRIMARY KEY is always the clustered index.
Non-Clustered Index: A separate structure that stores index keys + pointers (row locators) to the actual data. Multiple allowed per table. Requires an extra lookup ("bookmark lookup") to fetch the full row.

-- SQL Server syntax:
CREATE CLUSTERED INDEX idx_pk ON orders(order_id);
CREATE NONCLUSTERED INDEX idx_date ON orders(order_date);
 
-- A "covering index" avoids the bookmark lookup by
-- including all needed columns in the index itself:
CREATE INDEX idx_cover ON orders(customer_id)
  INCLUDE (order_date, amount);  -- covers the query below
SELECT order_date, amount FROM orders WHERE customer_id = 5;

What is a covering index?

Hard

A covering index contains all the columns a query needs, so the database engine can satisfy the query entirely from the index without accessing the base table rows — an "index-only scan".

-- Query: frequently run report
SELECT customer_id, order_date, amount
FROM orders
WHERE customer_id = 42 AND order_date > '2024-01-01';
 
-- Covering index: includes all three columns
CREATE INDEX idx_covering
ON orders (customer_id, order_date, amount);
-- Engine reads only the index; no row fetching needed.
 
-- EXPLAIN output shows "Using index" in MySQL
-- or "Index Only Scan" in PostgreSQL → covered query.

What is EXPLAIN / EXPLAIN ANALYZE and how do you read it?

Hard

EXPLAIN shows the query execution plan without running the query. EXPLAIN ANALYZE (PostgreSQL) actually executes the query and shows real timing and row counts.

EXPLAIN SELECT * FROM orders WHERE customer_id = 5;
-- MySQL output columns: id, select_type, table, type,
--   possible_keys, key, rows, Extra
 
-- Key 'type' values (best → worst):
-- const   → PK/unique lookup (1 row)
-- ref     → index lookup, multiple rows
-- range   → index range scan
-- index   → full index scan
-- ALL     → full table scan ← BAD
 
-- PostgreSQL:
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 5;
-- Shows: Seq Scan / Index Scan, cost=, actual time=, rows=

When does an index NOT get used?

Hard

Function applied to indexed column: WHERE YEAR(created_at) = 2024 — use range instead.
Leading wildcard: WHERE name LIKE '%son' — use full-text search.
Implicit type conversion: WHERE phone = 12345 (phone is VARCHAR).
Using OR across different columns without covering index.
Low cardinality columns (e.g. boolean flag) — full scan is cheaper.
Very small tables — sequential scan is faster.

-- Index NOT used:
WHERE UPPER(name) = 'ALICE'       -- function on column
WHERE salary + 1000 > 90000       -- expression on column
 
-- FIX: use functional/expression indexes where supported
CREATE INDEX idx_upper_name ON employees (UPPER(name));
WHERE UPPER(name) = 'ALICE'  -- now uses index

What is a partial index?

Hard

A partial index (PostgreSQL, SQLite) indexes only a subset of rows that satisfy a WHERE condition. Smaller index, faster scans, less maintenance overhead.

-- Only index active orders (not archived/cancelled)
CREATE INDEX idx_active_orders
ON orders (customer_id, order_date)
WHERE status = 'active';
 
-- Query that benefits:
SELECT * FROM orders
WHERE customer_id = 5
  AND order_date > '2024-01-01'
  AND status = 'active';  -- matches partial index condition
 
-- Index is far smaller than a full index on all rows.

What is query optimization? List common techniques.

Hard

Use proper indexes: Cover WHERE, JOIN, and ORDER BY columns.
Avoid SELECT *: Fetch only needed columns.
Filter early: Push conditions into subqueries/CTEs to reduce rows.
Avoid functions on indexed columns in WHERE.
Use EXISTS instead of IN with correlated subqueries.
Prefer UNION ALL over UNION when duplicates don't matter.
Partition large tables: partition pruning limits scanned data.
Update statistics: ANALYZE / UPDATE STATISTICS.
Avoid OR across columns; rewrite as UNION.

What is table partitioning?

Hard

Partitioning divides a large table into smaller physical segments, while appearing as one logical table. The query optimizer prunes irrelevant partitions — "partition pruning".

-- Range partitioning by year (PostgreSQL)
CREATE TABLE orders (
  id INT, order_date DATE, amount DECIMAL
) PARTITION BY RANGE (order_date);
 
CREATE TABLE orders_2023
  PARTITION OF orders FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
CREATE TABLE orders_2024
  PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
 
-- Query with partition pruning (scans only orders_2024):
SELECT * FROM orders WHERE order_date BETWEEN '2024-03-01' AND '2024-06-30';
 
-- Types: RANGE, LIST, HASH, COMPOSITE

What is cardinality in the context of indexes?

Medium

Cardinality is the number of distinct values in an indexed column. High cardinality (e.g. user_id) means the index is very selective — highly useful. Low cardinality (e.g. a boolean or status with 2 values) means the index is not selective — the optimizer may prefer a full table scan.

-- High cardinality (good index candidate)
email VARCHAR(100) UNIQUE   -- every value is different
 
-- Low cardinality (poor index candidate)
is_active BOOLEAN            -- only 2 values: 0 or 1
 
-- Exception: low-cardinality columns can still help
-- in a composite index or partial index:
CREATE INDEX idx_inactive ON users(email) WHERE is_active = 0;
-- Small fraction of rows → index is selective again.

What is the left-prefix rule for composite indexes?

Hard

A composite index on (a, b, c) can be used by queries that filter on the leftmost prefix: a, (a,b), or (a,b,c). Queries filtering only on b or c (without a) cannot use the index efficiently.

CREATE INDEX idx_abc ON t (a, b, c);
 
-- Uses index:          WHERE a = 1
-- Uses index:          WHERE a = 1 AND b = 2
-- Uses index:          WHERE a = 1 AND b = 2 AND c = 3
-- Uses index (a only): WHERE a = 1 AND c = 3  (skips b, uses only a)
-- Does NOT use:        WHERE b = 2            (no leading a)
-- Does NOT use:        WHERE c = 3            (no leading a)
 
-- Design tip: put high-cardinality and equality columns first,
-- then range columns last.

Window Functions

ROW_NUMBER, RANK, LAG, LEAD, NTILE, FRAME CLAUSE

10 Questions

What are window functions and how are they different from GROUP BY?

Hard

Window functions perform calculations across a set of rows related to the current row (a "window") without collapsing the rows into groups. Unlike GROUP BY, all original rows are retained in the result.

-- GROUP BY: collapses rows
SELECT dept, AVG(salary) FROM employees GROUP BY dept;
-- → one row per dept, individual employees gone
 
-- Window function: retains all rows
SELECT name, salary, dept,
  AVG(salary) OVER (PARTITION BY dept) AS dept_avg
FROM employees;
-- → all rows kept; dept_avg shown alongside each employee
 
-- Syntax: function() OVER (PARTITION BY ... ORDER BY ... frame_clause)

What is the difference between ROW_NUMBER, RANK, and DENSE_RANK?

Hard

ROW_NUMBER(): Unique sequential number regardless of ties. No gaps.
RANK(): Same rank for ties, but skips numbers afterward (1,1,3).
DENSE_RANK(): Same rank for ties, no gaps (1,1,2). Best for "top N" problems.

SELECT name, salary,
  ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
  RANK()       OVER (ORDER BY salary DESC) AS rnk,
  DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rnk
FROM employees;
 
-- salary=90000: row_num=1, rnk=1, dense_rnk=1
-- salary=90000: row_num=2, rnk=1, dense_rnk=1  (tie!)
-- salary=85000: row_num=3, rnk=3, dense_rnk=2  ← note gap in RANK

What are LAG and LEAD functions?

Medium

LAG(col, n) accesses a value from a previous row; LEAD(col, n) accesses a value from a subsequent row within the window. Useful for period-over-period comparisons.

SELECT
  order_date,
  revenue,
  LAG(revenue, 1, 0)  OVER (ORDER BY order_date) AS prev_month_rev,
  LEAD(revenue, 1, 0) OVER (ORDER BY order_date) AS next_month_rev,
  revenue - LAG(revenue, 1, 0) OVER (ORDER BY order_date) AS mom_change
FROM monthly_revenue;
 
-- Per-partition: compare each employee's salary to their
-- previous salary in the same department
SELECT name, dept, salary,
  LAG(salary) OVER (PARTITION BY dept ORDER BY hire_date) AS prev_salary
FROM employees;

What is NTILE()?

Medium

NTILE(n) divides the result set into n roughly equal buckets and assigns each row a bucket number. Useful for percentile grouping (quartiles, deciles).

-- Divide employees into salary quartiles
SELECT name, salary,
  NTILE(4) OVER (ORDER BY salary) AS quartile
FROM employees;
-- quartile 1 = lowest 25%, quartile 4 = top 25%
 
-- Deciles for scoring
SELECT customer_id, total_spend,
  NTILE(10) OVER (ORDER BY total_spend DESC) AS decile
FROM customer_totals;
-- decile 1 = top 10% spenders

What is the FRAME clause in window functions?

Hard

The frame clause defines the subset of rows within a partition that the window function considers for each row. Syntax: ROWS|RANGE BETWEEN start AND end.

-- 7-day moving average
SELECT
  order_date,
  revenue,
  AVG(revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS moving_avg_7d
FROM daily_revenue;
 
-- Common frame boundaries:
-- UNBOUNDED PRECEDING  → start of partition
-- N PRECEDING          → N rows before current
-- CURRENT ROW          → current row
-- N FOLLOWING          → N rows after current
-- UNBOUNDED FOLLOWING  → end of partition
 
-- ROWS vs RANGE:
-- ROWS: physical row offset (exact N rows)
-- RANGE: logical value range (rows with same ORDER BY value)

What are FIRST_VALUE and LAST_VALUE?

Medium

-- FIRST_VALUE: value of specified column from first row in window
-- LAST_VALUE:  value from last row in window (requires explicit frame)
 
SELECT name, dept, salary,
  FIRST_VALUE(salary) OVER (
    PARTITION BY dept ORDER BY salary DESC
  ) AS top_salary_in_dept,
  LAST_VALUE(salary) OVER (
    PARTITION BY dept ORDER BY salary DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    -- IMPORTANT: LAST_VALUE needs full frame, else defaults to CURRENT ROW
  ) AS bottom_salary_in_dept
FROM employees;

How do you get the top N rows per group?

Medium

-- Top 3 highest-paid employees per department
SELECT name, dept, salary FROM (
  SELECT name, dept, salary,
    ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
  FROM employees
) ranked
WHERE rn <= 3;
 
-- Use RANK() instead of ROW_NUMBER() if you want ties included:
SELECT name, dept, salary FROM (
  SELECT name, dept, salary,
    RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rnk
  FROM employees
) ranked
WHERE rnk <= 3;

What is PERCENT_RANK and CUME_DIST?

Hard

PERCENT_RANK(): Relative rank as a percentage: (rank - 1) / (total_rows - 1). Returns 0 for the first row, 1 for the last.
CUME_DIST(): Cumulative distribution — fraction of rows with values ≤ current row. Always > 0, ≤ 1.

SELECT name, salary,
  PERCENT_RANK() OVER (ORDER BY salary) AS pct_rank,
  CUME_DIST()    OVER (ORDER BY salary) AS cume_dist
FROM employees;
 
-- pct_rank: 0 = lowest, 1 = highest
-- cume_dist = 0.75 means salary is >= 75% of all employees

How do you calculate a year-over-year percentage change using window functions?

Hard

WITH yearly_sales AS (
  SELECT YEAR(order_date) AS yr, SUM(amount) AS total
  FROM orders
  GROUP BY YEAR(order_date)
)
SELECT
  yr,
  total,
  LAG(total) OVER (ORDER BY yr) AS prev_year,
  ROUND(
    100.0 * (total - LAG(total) OVER (ORDER BY yr))
          / NULLIF(LAG(total) OVER (ORDER BY yr), 0),
    2
  ) AS yoy_pct_change
FROM yearly_sales
ORDER BY yr;

What is the PARTITION BY clause in window functions?

Medium

PARTITION BY divides the rows into groups (partitions) before the window function is applied — analogous to GROUP BY but without collapsing rows. Each partition is calculated independently.

-- Without PARTITION BY: one global window
SELECT name, salary,
  AVG(salary) OVER () AS company_avg
FROM employees;
 
-- With PARTITION BY: one window per department
SELECT name, dept, salary,
  AVG(salary) OVER (PARTITION BY dept) AS dept_avg,
  salary - AVG(salary) OVER (PARTITION BY dept) AS diff_from_avg
FROM employees;

Views, Stored Procedures & Triggers

Materialized Views, Functions, Triggers, Cursors

10 Questions

What is a view and what are its advantages and limitations?

Easy

A view is a named, stored SELECT query that behaves like a virtual table. It does not store data (by default).

Advantages: Simplifies complex queries; provides security (hide sensitive columns); encapsulates business logic.
Limitations: No inherent performance gain (query runs each time); cannot always be updated (complex joins, aggregates make views non-updatable).

CREATE VIEW active_customers AS
SELECT id, name, email, city
FROM customers
WHERE is_active = 1;
 
-- Use like a table:
SELECT * FROM active_customers WHERE city = 'Mumbai';
 
-- Update view:
CREATE OR REPLACE VIEW active_customers AS ...;
DROP VIEW active_customers;

What is a materialized view?

Hard

A materialized view actually stores the result set on disk. It acts like a cache — queries hit pre-computed data. Must be refreshed when underlying data changes.

-- PostgreSQL
CREATE MATERIALIZED VIEW dept_summary AS
SELECT dept, COUNT(*) AS headcount, AVG(salary) AS avg_salary
FROM employees GROUP BY dept;
 
-- Query hits stored data (very fast):
SELECT * FROM dept_summary;
 
-- Refresh on demand:
REFRESH MATERIALIZED VIEW dept_summary;
 
-- Concurrent refresh (no locking):
REFRESH MATERIALIZED VIEW CONCURRENTLY dept_summary;
 
-- Use case: expensive aggregations, reporting tables,
-- dashboards queried frequently with infrequent data changes.

What is a stored procedure and how does it differ from a function?

Medium

Stored Procedure: A named block of SQL/procedural code stored in the DB. Can have IN/OUT/INOUT parameters. Can perform DML, DDL, transaction control. Called with CALL. Cannot be used inside a SELECT.
Function (UDF): Must return a value. Can be used inside SELECT/WHERE. Usually cannot perform transaction control. Pure functions are deterministic.

-- Stored Procedure (MySQL)
DELIMITER //
CREATE PROCEDURE give_raise(IN dept_name VARCHAR(50), IN pct DECIMAL(5,2))
BEGIN
  UPDATE employees SET salary = salary * (1 + pct/100)
  WHERE dept = dept_name;
END //
DELIMITER ;
 
CALL give_raise('Engineering', 10.0);
 
-- Function
CREATE FUNCTION full_name(first VARCHAR(50), last VARCHAR(50))
RETURNS VARCHAR(100) DETERMINISTIC
RETURN CONCAT(first, ' ', last);
 
SELECT full_name(first_name, last_name) FROM employees;

What is a trigger in SQL?

Medium

A trigger is a stored procedure that automatically executes in response to a specified DML event (INSERT, UPDATE, DELETE) on a table, either BEFORE or AFTER the event.

-- Audit trigger: log every salary change
CREATE TABLE salary_audit (
  id INT AUTO_INCREMENT PRIMARY KEY,
  emp_id INT, old_salary DECIMAL, new_salary DECIMAL,
  changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
CREATE TRIGGER trg_salary_audit
AFTER UPDATE ON employees
FOR EACH ROW
BEGIN
  IF OLD.salary <> NEW.salary THEN
    INSERT INTO salary_audit(emp_id, old_salary, new_salary)
    VALUES (NEW.id, OLD.salary, NEW.salary);
  END IF;
END;
 
-- Trigger types: BEFORE INSERT, AFTER INSERT,
--               BEFORE UPDATE, AFTER UPDATE,
--               BEFORE DELETE, AFTER DELETE

What is a cursor in SQL?

Hard

A cursor allows row-by-row processing of a result set inside a stored procedure or function. They are slow and should be replaced by set-based operations wherever possible.

DECLARE done INT DEFAULT 0;
DECLARE emp_id INT;
DECLARE emp_name VARCHAR(100);
DECLARE cur CURSOR FOR SELECT id, name FROM employees WHERE salary < 30000;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
 
OPEN cur;
read_loop: LOOP
  FETCH cur INTO emp_id, emp_name;
  IF done THEN LEAVE read_loop; END IF;
  -- process each row...
  UPDATE employees SET salary = salary * 1.05 WHERE id = emp_id;
END LOOP;
CLOSE cur;

Prefer: UPDATE employees SET salary = salary * 1.05 WHERE salary < 30000; — achieves the same in one set operation.

What are INSTEAD OF triggers?

Hard

INSTEAD OF triggers (SQL Server, Oracle) fire in place of the triggering DML statement, allowing you to make otherwise non-updatable views (those with JOINs or aggregates) updatable by redirecting the operation to the underlying tables.

-- SQL Server example: make a JOIN view updatable
CREATE VIEW emp_dept_view AS
SELECT e.id, e.name, d.name AS dept_name
FROM employees e JOIN departments d ON e.dept_id = d.id;
 
-- Without INSTEAD OF, UPDATE on this view fails.
CREATE TRIGGER trg_instead_update
ON emp_dept_view
INSTEAD OF UPDATE
AS
BEGIN
  UPDATE employees
  SET name = i.name
  FROM inserted i
  WHERE employees.id = i.id;
END;

What are advantages and disadvantages of stored procedures?

Medium

Advantages: Reduced network round-trips; pre-compiled execution plans; centralized logic; security (grant EXECUTE without exposing tables); prevent SQL injection when used with parameters.
Disadvantages: Harder to version-control and test than application code; DB-specific syntax makes migration difficult; business logic in the DB layer is hard to scale horizontally; debugging is limited.

What is a sequence / auto-increment?

Easy

-- MySQL: AUTO_INCREMENT
CREATE TABLE users (
  id   INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(100)
);
INSERT INTO users(name) VALUES ('Alice');
SELECT LAST_INSERT_ID();  -- returns generated id
 
-- PostgreSQL: SERIAL / SEQUENCE
CREATE TABLE users (
  id   SERIAL PRIMARY KEY,   -- creates a sequence automatically
  name VARCHAR(100)
);
SELECT currval('users_id_seq');
SELECT nextval('users_id_seq');
 
-- PostgreSQL modern (SQL standard):
CREATE TABLE users (id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY);

What are updatable views and the WITH CHECK OPTION?

Medium

A view is updatable when it references a single base table, has no DISTINCT, no aggregates, no GROUP BY, no subqueries in SELECT. WITH CHECK OPTION prevents INSERT/UPDATE through the view that would make the row invisible to the view.

CREATE VIEW active_users AS
SELECT id, name, email FROM users WHERE is_active = 1
WITH CHECK OPTION;
 
-- OK: new row passes the view's WHERE condition
INSERT INTO active_users VALUES (100, 'Bob', 'bob@x.com');
-- This would set is_active = 1 implicitly (base table default).
 
-- REJECTED: row would not be visible through the view
UPDATE active_users SET is_active = 0 WHERE id = 1;
-- CHECK OPTION prevents making the row disappear from view.

What is a user-defined function (UDF) and its types?

Medium

Scalar function: Returns a single value. Used inline in expressions.
Table-valued function (TVF): Returns a result set (table). Used in FROM clause.
Aggregate function: Custom aggregate; rare, database-specific.

-- Scalar UDF (PostgreSQL)
CREATE OR REPLACE FUNCTION tax_amount(price DECIMAL, rate DECIMAL)
RETURNS DECIMAL AS $$
  SELECT ROUND(price * rate / 100, 2);
$$ LANGUAGE sql IMMUTABLE;
 
SELECT product, price, tax_amount(price, 18) FROM products;
 
-- Table-valued function (SQL Server)
CREATE FUNCTION dbo.GetDeptEmployees(@dept NVARCHAR(50))
RETURNS TABLE AS RETURN (
  SELECT id, name, salary FROM employees WHERE dept = @dept
);
SELECT * FROM dbo.GetDeptEmployees('Sales');

Transactions & ACID

Isolation Levels, Locking, Deadlocks, SAVEPOINT

5 Questions

What are ACID properties?

Medium

Atomicity: All operations in a transaction succeed or all fail (all-or-nothing). No partial commits.
Consistency: The database moves from one valid state to another. Constraints and rules are never violated.
Isolation: Concurrent transactions do not interfere with each other. Each sees a consistent snapshot.
Durability: Once committed, a transaction survives crashes. Changes are persisted to disk (WAL / redo log).

BEGIN;
  UPDATE accounts SET balance = balance - 500 WHERE id = 1;
  UPDATE accounts SET balance = balance + 500 WHERE id = 2;
COMMIT;   -- both succeed, or ROLLBACK on error → Atomicity

What are transaction isolation levels?

Hard

Isolation levels control which anomalies concurrent transactions can see:

READ UNCOMMITTED: Can see uncommitted changes of other transactions — dirty reads possible. Lowest isolation.
READ COMMITTED: Only sees committed data. No dirty reads, but non-repeatable reads possible. Default in many DBs.
REPEATABLE READ: Same query returns same rows within a transaction. No dirty or non-repeatable reads, but phantom reads possible. Default in MySQL InnoDB.
SERIALIZABLE: Full isolation — transactions execute as if serial. No anomalies. Highest isolation, lowest concurrency.

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
  SELECT ... ;  -- reads a consistent snapshot throughout
COMMIT;

What is a deadlock and how do you prevent it?

Hard

A deadlock occurs when two or more transactions each hold a lock and are each waiting for a lock held by the other, creating a cycle. The database detects this and kills one transaction (victim).

-- T1: locks row A, then tries to lock row B
-- T2: locks row B, then tries to lock row A → DEADLOCK
 
-- Prevention strategies:
-- 1. Always access tables/rows in the SAME ORDER in all transactions
-- 2. Keep transactions short — reduce lock hold time
-- 3. Use lower isolation levels (READ COMMITTED) where safe
-- 4. Use SELECT ... FOR UPDATE SKIP LOCKED (skip contended rows)
-- 5. Implement optimistic locking (version column check)
 
-- Retry logic on deadlock error code (MySQL: 1213)
-- is essential in application code.

What is a SAVEPOINT?

Medium

A SAVEPOINT marks a point within a transaction to which you can roll back without rolling back the entire transaction. Useful for complex multi-step operations.

BEGIN;
  INSERT INTO orders(...) VALUES (...);
  SAVEPOINT after_order;
 
  INSERT INTO order_items(...) VALUES (...);
  -- Something goes wrong with items:
  ROLLBACK TO SAVEPOINT after_order;
  -- Order is preserved, items are rolled back
 
  -- Try again or handle error...
  INSERT INTO order_items(...) VALUES (...);  -- retry
COMMIT;

What is SELECT FOR UPDATE?

Hard

SELECT ... FOR UPDATE acquires an exclusive lock on the selected rows, preventing other transactions from updating or locking those rows until the current transaction commits or rolls back. Used to implement pessimistic locking.

-- Reserve a seat: lock the row before updating
BEGIN;
SELECT * FROM seats
WHERE seat_id = 42 AND status = 'available'
FOR UPDATE;   -- acquires row lock
 
-- If another transaction tries the same query, it waits.
UPDATE seats SET status = 'booked', user_id = 99
WHERE seat_id = 42;
COMMIT;
 
-- SKIP LOCKED: skip already-locked rows (task queue pattern)
SELECT id, task FROM job_queue
WHERE status = 'pending'
LIMIT 1
FOR UPDATE SKIP LOCKED;

Advanced SQL

JSON, Full-Text Search, Temporal Tables, Gap-and-Island, SQL Injection

10 Questions

What is SQL injection and how do you prevent it?

Hard

SQL injection is an attack where malicious SQL is injected into user input that gets executed by the database. Prevention:

-- VULNERABLE: string concatenation
query = "SELECT * FROM users WHERE email = '" + user_input + "'";
-- Input: ' OR '1'='1 → returns ALL users!
 
-- SAFE: parameterized queries / prepared statements
-- Python (psycopg2)
cursor.execute("SELECT * FROM users WHERE email = %s", (user_email,))
 
-- Java (JDBC)
PreparedStatement ps = conn.prepareStatement(
  "SELECT * FROM users WHERE email = ?");
ps.setString(1, userEmail);
 
-- Other defenses:
-- Stored procedures with parameters
-- Input validation and allowlisting
-- Least-privilege DB accounts
-- Web Application Firewalls (WAF)

How does SQL handle JSON data?

Hard

-- PostgreSQL: JSON / JSONB (binary, indexed)
CREATE TABLE events (id SERIAL, data JSONB);
 
INSERT INTO events(data) VALUES
('{"user_id": 1, "action": "click", "tags": ["web", "mobile"]}');
 
-- Access operators
SELECT data->>'user_id'         FROM events;  -- text
SELECT data->'tags'->>0         FROM events;  -- first tag
SELECT data @> '{"action":"click"}' FROM events; -- contains
 
-- Index JSONB
CREATE INDEX idx_gin ON events USING GIN (data);
 
-- MySQL JSON
SELECT JSON_EXTRACT(data, '$.user_id') FROM events;
SELECT data->>'$.action' FROM events;  -- shorthand
UPDATE events SET data = JSON_SET(data, '$.status', 'done');

What is full-text search in SQL?

Hard

Full-text search enables searching natural language in text columns, with features like stemming, stop words, and relevance ranking — much more powerful than LIKE '%word%'.

-- MySQL Full-Text Search
CREATE TABLE articles (
  id INT PRIMARY KEY, title TEXT, body TEXT,
  FULLTEXT (title, body)
);
 
-- Natural language mode
SELECT *, MATCH(title, body) AGAINST('database indexing') AS score
FROM articles
WHERE MATCH(title, body) AGAINST('database indexing')
ORDER BY score DESC;
 
-- Boolean mode
WHERE MATCH(title,body) AGAINST('+SQL -NoSQL' IN BOOLEAN MODE);
 
-- PostgreSQL: tsvector / tsquery
SELECT title FROM articles
WHERE to_tsvector('english', body) @@ to_tsquery('database & index');
 
CREATE INDEX idx_fts ON articles USING GIN(to_tsvector('english', body));

What is the gaps-and-islands problem?

Hard

Gaps-and-islands problems involve finding consecutive sequences ("islands") and breaks ("gaps") in ordered data — e.g., consecutive login days, contiguous date ranges of availability.

-- Find consecutive login day ranges per user
-- Classic solution: subtract row_number from date to get group_id
 
WITH ranked AS (
  SELECT user_id, login_date,
    login_date - ROW_NUMBER() OVER (PARTITION BY user_id
                                    ORDER BY login_date) * INTERVAL '1 day'
    AS grp
  FROM (SELECT DISTINCT user_id, login_date FROM logins) t
)
SELECT user_id,
  MIN(login_date) AS streak_start,
  MAX(login_date) AS streak_end,
  COUNT(*)        AS streak_length
FROM ranked
GROUP BY user_id, grp
ORDER BY user_id, streak_start;

100

What are temporal tables (system-versioned tables)?

Hard

System-versioned temporal tables (SQL:2011 standard) automatically track the full history of row changes. Each row records its validity period. Useful for audit trails and point-in-time queries.

-- SQL Server
CREATE TABLE employees (
  id       INT PRIMARY KEY,
  name     VARCHAR(100),
  salary   DECIMAL(10,2),
  valid_from DATETIME2 GENERATED ALWAYS AS ROW START,
  valid_to   DATETIME2 GENERATED ALWAYS AS ROW END,
  PERIOD FOR SYSTEM_TIME (valid_from, valid_to)
) WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.employees_history));
 
-- Query historical state at a point in time:
SELECT * FROM employees
FOR SYSTEM_TIME AS OF '2023-06-01 12:00:00';
 
-- Query all changes:
SELECT * FROM employees
FOR SYSTEM_TIME ALL
WHERE id = 42 ORDER BY valid_from;

101

What is COALESCE and how does it differ from NULLIF and IFNULL?

Medium

COALESCE(v1, v2, ...): Returns the first non-NULL value from the list. SQL standard.
NULLIF(a, b): Returns NULL if a = b, otherwise returns a. Useful to avoid division-by-zero.
IFNULL(a, b): MySQL/SQLite — returns b if a is NULL (equivalent to COALESCE with 2 args).
ISNULL(a, b): SQL Server equivalent of IFNULL.

SELECT COALESCE(phone, mobile, 'No contact') FROM users;
 
-- Avoid division by zero:
SELECT total / NULLIF(count, 0) AS avg_value FROM stats;
-- NULLIF returns NULL when count=0 → result is NULL, not error
 
SELECT IFNULL(salary, 0) FROM employees;  -- MySQL

102

What is the CASE expression?

Easy

CASE is SQL's conditional expression, similar to if-else. Two forms: simple (compare one value) and searched (arbitrary boolean conditions).

-- Searched CASE (most flexible)
SELECT name, salary,
  CASE
    WHEN salary >= 100000 THEN 'Senior'
    WHEN salary >= 60000  THEN 'Mid'
    ELSE                       'Junior'
  END AS level
FROM employees;
 
-- Simple CASE (equality match)
SELECT order_id,
  CASE status
    WHEN 'P' THEN 'Pending'
    WHEN 'S' THEN 'Shipped'
    WHEN 'D' THEN 'Delivered'
    ELSE 'Unknown'
  END AS status_label
FROM orders;
 
-- CASE inside aggregate (conditional count)
SELECT
  COUNT(CASE WHEN gender = 'M' THEN 1 END) AS male_count,
  COUNT(CASE WHEN gender = 'F' THEN 1 END) AS female_count
FROM employees;

103

What are common date/time functions in SQL?

Medium

-- MySQL
SELECT NOW(), CURDATE(), CURTIME();
SELECT YEAR(order_date), MONTH(order_date), DAY(order_date)
FROM orders;
SELECT DATEDIFF('2024-12-31', '2024-01-01');  -- 365
SELECT DATE_ADD(order_date, INTERVAL 30 DAY);
SELECT DATE_FORMAT(order_date, '%Y-%m');  -- '2024-03'
 
-- PostgreSQL
SELECT NOW(), CURRENT_DATE, CURRENT_TIME;
SELECT DATE_PART('year', order_date) AS yr;
SELECT order_date + INTERVAL '30 days';
SELECT TO_CHAR(order_date, 'YYYY-MM');
SELECT AGE(CURRENT_DATE, birth_date);  -- interval
 
-- SQL Server
SELECT GETDATE(), YEAR(order_date), DATEADD(day, 30, order_date);
SELECT DATEDIFF(day, start_date, end_date);
SELECT FORMAT(order_date, 'yyyy-MM');

104

What is STRING_AGG / GROUP_CONCAT?

Medium

Aggregates string values from multiple rows into a single concatenated string within a group.

-- PostgreSQL / SQL Server: STRING_AGG
SELECT dept,
  STRING_AGG(name, ', ' ORDER BY name) AS employees_list
FROM employees
GROUP BY dept;
-- Engineering: 'Alice, Bob, Charlie'
-- Sales: 'David, Eve'
 
-- MySQL: GROUP_CONCAT
SELECT dept,
  GROUP_CONCAT(name ORDER BY name SEPARATOR ', ') AS employees_list
FROM employees
GROUP BY dept;
 
-- Useful for: tag lists, comma-separated IDs,
-- report-style aggregated columns.

105

What is the difference between OLTP and OLAP?

Medium

OLTP (Online Transaction Processing): Handles many short, fast transactions (INSERT/UPDATE/DELETE). Highly normalized, row-oriented storage. Examples: banking, e-commerce. Databases: PostgreSQL, MySQL, SQL Server.
OLAP (Online Analytical Processing): Handles complex, read-heavy analytical queries over large historical datasets. Denormalized (star/snowflake schema), column-oriented storage, optimized for aggregations. Examples: BI/reporting. Databases: Redshift, BigQuery, Snowflake, ClickHouse.

-- OLTP query: single row lookup
SELECT * FROM orders WHERE order_id = 100042;  -- ms latency
 
-- OLAP query: full scan aggregation
SELECT region, product_category, SUM(revenue)
FROM fact_sales JOIN dim_date USING(date_id)
WHERE year = 2024
GROUP BY ROLLUP(region, product_category);
-- May scan billions of rows — columnar storage is key.

Top 100+ SQL Interview
Questions For Beginners

Keep Practising

Get Notified of New Resources