Indexing¶

An index is a data structure (typically B-tree or B+ tree) that speeds up data retrieval at the cost of extra storage and slower writes. Clustered index defines physical row order (one per table). Non-clustered index has a separate structure pointing to rows. Know when to index (high-cardinality columns in WHERE/JOIN/ORDER BY) and when NOT to (small tables, frequently updated columns, low-cardinality columns).

Key Concepts¶

Deep Dive: How Indexes Work (B+ Tree)

Without index: Full table scan — O(n). With index: B+ tree lookup — O(log n).

B+ Tree Index on 'email' column:

          [M]
       /       \
    [D, H]    [P, T]
   / | \     / | \
[A-C][D-G][H-L][M-O][P-S][T-Z]  ← Leaf nodes (sorted, linked)
                                   Each points to actual rows

Key properties: - Leaf nodes contain actual data pointers and are linked for range queries - All leaves at the same depth → consistent O(log n) lookups - Each node can hold many keys → fewer disk reads

Deep Dive: Types of Indexes

Clustered Index: - Defines the physical order of data in the table - Only one per table (usually the primary key) - Table IS the index

Non-Clustered Index: - Separate structure with pointers to rows - Multiple per table - Extra lookup to get actual data (unless covering index)

Composite Index:

CREATE INDEX idx_name_dept ON employees (department, name);
-- Useful for: WHERE department = 'Engineering' AND name = 'John'
-- Useful for: WHERE department = 'Engineering' (leftmost prefix)
-- NOT useful for: WHERE name = 'John' alone (not leftmost)

Leftmost prefix rule: A composite index (A, B, C) can serve queries on (A), (A, B), or (A, B, C), but NOT (B) or (C) alone.

Unique Index:

CREATE UNIQUE INDEX idx_email ON users (email);
-- Enforces uniqueness + speeds up lookups

Covering Index:

-- Index includes all columns needed by query → no table lookup
CREATE INDEX idx_covering ON orders (user_id, total, order_date);
-- This query is "covered":
SELECT total, order_date FROM orders WHERE user_id = 1;

Partial (Filtered) Index:

CREATE INDEX idx_active ON users (email) WHERE status = 'ACTIVE';
-- Smaller index, only for active users

Deep Dive: When to Index (and When Not To)

Index these: - Primary key (auto-indexed) - Foreign keys (used in JOINs) - Columns in WHERE clauses - Columns in ORDER BY / GROUP BY - High-cardinality columns (many distinct values)

Don't index: - Small tables (full scan is faster) - Columns rarely used in queries - Low-cardinality columns (boolean, status with 2-3 values) - Frequently updated columns (index rebuild cost) - Tables with heavy INSERT/UPDATE/DELETE workload

Costs of indexes: - Extra storage (indexes can be 10-20% of table size) - Slower writes (must update indexes on INSERT/UPDATE/DELETE) - Maintenance overhead

Deep Dive: EXPLAIN / Query Plan

Use EXPLAIN to see how the database executes a query:

EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'john@test.com';

Key things to look for: - Seq Scan — full table scan (bad for large tables) - Index Scan — using an index (good) - Index Only Scan — covering index (best) - Bitmap Index Scan — multiple index results combined - Nested Loop / Hash Join / Merge Join — join strategies

Red flags: - Sequential scan on a large table - High estimated rows vs actual rows - Sort operations on unindexed columns

Common Interview Questions

What is an index? How does it improve query performance?
What is a B+ tree? Why is it used for indexes?
What is the difference between clustered and non-clustered indexes?
What is a composite index? Explain the leftmost prefix rule.
What is a covering index?
When should you NOT add an index?
How do indexes affect INSERT/UPDATE/DELETE performance?
How do you identify slow queries?
What does EXPLAIN do?
What is index selectivity / cardinality?