Indexing¶
An index is a data structure (typically B-tree or B+ tree) that speeds up data retrieval at the cost of extra storage and slower writes. Clustered index defines physical row order (one per table). Non-clustered index has a separate structure pointing to rows. Know when to index (high-cardinality columns in WHERE/JOIN/ORDER BY) and when NOT to (small tables, frequently updated columns, low-cardinality columns).
Key Concepts¶
Deep Dive: How Indexes Work (B+ Tree)
Without index: Full table scan — O(n). With index: B+ tree lookup — O(log n).
B+ Tree Index on 'email' column:
[M]
/ \
[D, H] [P, T]
/ | \ / | \
[A-C][D-G][H-L][M-O][P-S][T-Z] ← Leaf nodes (sorted, linked)
Each points to actual rows
Key properties: - Leaf nodes contain actual data pointers and are linked for range queries - All leaves at the same depth → consistent O(log n) lookups - Each node can hold many keys → fewer disk reads
Deep Dive: Types of Indexes
Clustered Index: - Defines the physical order of data in the table - Only one per table (usually the primary key) - Table IS the index
Non-Clustered Index: - Separate structure with pointers to rows - Multiple per table - Extra lookup to get actual data (unless covering index)
Composite Index:
CREATE INDEX idx_name_dept ON employees (department, name);
-- Useful for: WHERE department = 'Engineering' AND name = 'John'
-- Useful for: WHERE department = 'Engineering' (leftmost prefix)
-- NOT useful for: WHERE name = 'John' alone (not leftmost)
(A, B, C) can serve queries on (A), (A, B), or (A, B, C), but NOT (B) or (C) alone.
Unique Index:
Covering Index:
-- Index includes all columns needed by query → no table lookup
CREATE INDEX idx_covering ON orders (user_id, total, order_date);
-- This query is "covered":
SELECT total, order_date FROM orders WHERE user_id = 1;
Partial (Filtered) Index:
Deep Dive: When to Index (and When Not To)
Index these:
- Primary key (auto-indexed)
- Foreign keys (used in JOINs)
- Columns in WHERE clauses
- Columns in ORDER BY / GROUP BY
- High-cardinality columns (many distinct values)
Don't index: - Small tables (full scan is faster) - Columns rarely used in queries - Low-cardinality columns (boolean, status with 2-3 values) - Frequently updated columns (index rebuild cost) - Tables with heavy INSERT/UPDATE/DELETE workload
Costs of indexes: - Extra storage (indexes can be 10-20% of table size) - Slower writes (must update indexes on INSERT/UPDATE/DELETE) - Maintenance overhead
Deep Dive: EXPLAIN / Query Plan
Use EXPLAIN to see how the database executes a query:
Key things to look for: - Seq Scan — full table scan (bad for large tables) - Index Scan — using an index (good) - Index Only Scan — covering index (best) - Bitmap Index Scan — multiple index results combined - Nested Loop / Hash Join / Merge Join — join strategies
Red flags: - Sequential scan on a large table - High estimated rows vs actual rows - Sort operations on unindexed columns
Common Interview Questions
- What is an index? How does it improve query performance?
- What is a B+ tree? Why is it used for indexes?
- What is the difference between clustered and non-clustered indexes?
- What is a composite index? Explain the leftmost prefix rule.
- What is a covering index?
- When should you NOT add an index?
- How do indexes affect INSERT/UPDATE/DELETE performance?
- How do you identify slow queries?
- What does EXPLAIN do?
- What is index selectivity / cardinality?