Other Key Data Structures
This chapter delves into two fundamental data structures: hash tables and graph representations. A thorough understanding of their principles, implementation details, and complexity analysis is critical for designing efficient algorithms and is a recurring focus in CMI examinations.
---
Chapter Contents
|
| Topic |
|---|-------| | 1 | Hash Tables | | 2 | Graph Representations |---
We begin with Hash Tables.
Part 1: Hash Tables
Hash tables are fundamental data structures that store key-value pairs, providing efficient average-case performance for insertion, deletion, and search operations. We use hash functions to map keys to indices within an array, enabling direct access.
---
Core Concepts
1. Hash Functions
A hash function maps a key to an index in a hash table. A good hash function is deterministic, efficiently computable, and distributes keys uniformly across the table to minimize collisions.
Worked Example: Division Method
Consider a hash table of size . We want to insert keys .
Step 1: Calculate hash for key .
>
Step 2: Calculate hash for key .
>
Step 3: Calculate hash for key .
>
Step 4: Calculate hash for key .
>
Step 5: Calculate hash for key .
>
Answer: The keys are mapped to indices respectively.
:::question type="MCQ" question="Given a hash table of size , which of the following keys will collide with key using the division method hash function ?" options=["","","",""] answer="" hint="Calculate for each option and compare with ." solution="Step 1: Calculate the hash for the given key .
>
Step 2: Calculate the hash for each option.
>
>
>
>
Step 3: Identify the key that hashes to the same index as . Both and collide with . However, usually, a multiple-choice question has only one correct option (or it's an MSQ). If only one option is expected, is a valid answer. If multiple are correct, the question type should be MSQ. Assuming a single best fit or a common error in question setting, is a direct collision. In a strictly constructed MCQ, only one option would match. Since both and collide, the question might be flawed or imply selecting any one. If forced to pick one, is a valid collision. For CMI, we assume perfect questions. Let's re-evaluate. If is an option, it also collides. Let's assume the question expects a key that collides.
Corrected options to ensure only one collision:
Options: ["", "", "", ""]
This question is poorly formed for an MCQ if two options collide. Let's pick one for the answer and assume the question setters intended only one. Let's choose . If it were an MSQ, both would be correct. I will ensure my future questions have a single correct answer for MCQ.
Let's assume the options were: ["", "", "", ""]. Then only would collide.
Given the current options, both and collide. For the purpose of providing a single answer for MCQ, I must pick one. I will pick .
Corrected Answer based on initial options: Both and collide. If this was an MSQ, both would be correct. Assuming the question intends a single answer, there might be an issue. But if we must pick one, let's pick .
Let's refine the options to make it a clear MCQ:
Options: ["", "", "", ""]
Then .
.
.
.
In this refined set, only collides. I will proceed with this refined understanding.
Original options: ["","","",""]
. So collides.
. So collides.
This is an MSQ. I need to ensure my questions are single-answer for MCQ. I will fix the options.
Let's change the question options to ensure only one correct answer for an MCQ.
Original Question: "Given a hash table of size , which of the following keys will collide with key using the division method hash function ?"
Original Options: ["","","",""]
.
.
.
This is an issue for MCQ. I need to fix the options.
Let's make one of them unique. I will change to .
New Options: ["","","",""]
Now:
.
. (Collides with )
.
.
So, only collides. This is a valid MCQ. I will use this corrected version.
Step 1: Calculate the hash for the given key .
>
Step 2: Calculate the hash for each option.
>
>
>
>
Step 3: Identify the key that hashes to the same index as . Key hashes to index , which is the same as .
"
:::
Worked Example: Multiplication Method
Consider a hash table of size . We want to insert key . Use .
Step 1: Calculate .
>
Step 2: Calculate . This is the fractional part.
>
Step 3: Multiply by and take the floor.
>
Answer: The key is mapped to index .
:::question type="NAT" question="Using the multiplication method with table size and constant , what is the hash index for the key ?" answer="14" hint="Follow the formula: . Calculate , then its fractional part, then multiply by and take the floor." solution="Step 1: Calculate .
>
Step 2: Calculate .
>
Step 3: Multiply by and take the floor.
>
Wait, this is a bad example. If is an integer, is . Let me adjust the key or .
Let and .
Step 1: Calculate .
>
Step 2: Calculate .
>
Step 3: Multiply by and take the floor.
>
This works better. I will use .
Revised question:
:::question type="NAT" question="Using the multiplication method with table size and constant , what is the hash index for the key ?" answer="57" hint="Follow the formula: . Calculate , then its fractional part, then multiply by and take the floor." solution="Step 1: Calculate .
>
Step 2: Calculate . This is the fractional part of .
>
Step 3: Multiply by and take the floor.
>
"
:::
---
2. Collision Resolution
When two different keys hash to the same index, a collision occurs. Collision resolution strategies determine how to handle these situations.
2.1. Separate Chaining
Each slot in the hash table points to a linked list (or other dynamic data structure) that contains all keys hashing to that slot.
Worked Example: Separate Chaining
Insert keys into a hash table of size using the division method .
Step 1: Insert . .
Table:
Index 0:
Index 1:
Index 2:
Index 3:
Index 4:
Index 5:
Index 6:
Step 2: Insert . .
Table:
Index 0:
Index 1:
Index 2:
Index 3:
Index 4:
Index 5:
Index 6:
Step 3: Insert . . Collision at index . Add to linked list.
Table:
Index 0:
Index 1:
Index 2:
Index 3:
Index 4:
Index 5:
Index 6:
Step 4: Insert . .
Table:
Index 0:
Index 1:
Index 2:
Index 3:
Index 4:
Index 5:
Index 6:
Step 5: Insert . .
Table:
Index 0:
Index 1:
Index 2:
Index 3:
Index 4:
Index 5:
Index 6:
Step 6: Insert . . Collision at index . Add to linked list.
Table:
Index 0:
Index 1:
Index 2:
Index 3:
Index 4:
Index 5:
Index 6:
Answer: The final state of the hash table with separate chaining.
:::question type="MCQ" question="A hash table of size uses separate chaining and the hash function . If keys are inserted in that order, which bucket will contain the most elements?" options=["Bucket 0","Bucket 1","Bucket 2","Bucket 3"] answer="Bucket 2" hint="Calculate for each key and observe which index accumulates the most." solution="Step 1: Calculate hash values for each key:
>
>
>
>
>
Step 2: All keys hash to Bucket 2.
Step 3: Therefore, Bucket 2 will contain all 5 elements, which is the most.
"
:::
---
2.2. Open Addressing
All elements are stored directly in the hash table itself. When a collision occurs, we systematically probe for an empty slot. The probe sequence is defined by a probe function , where is the probe number (starting from ).
2.2.1. Linear Probing
The probe sequence is linear: . This leads to primary clustering.
Worked Example: Linear Probing
Insert keys into a hash table of size using and linear probing.
Step 1: Insert . .
Table:
Step 2: Insert . .
Table:
Step 3: Insert . . Collision at index .
Try : . Slot is empty.
Table:
Step 4: Insert . . Slot is empty.
Table:
Step 5: Insert . . Collision at index .
Try : . Collision at index .
Try : . Collision at index .
Try : . Slot is empty.
Table:
Answer: The final state of the hash table using linear probing.
:::question type="MCQ" question="A hash table of size uses linear probing and the hash function . If keys are inserted in that order, what is the final position of key ?" options=["3","4","5","6"] answer="6" hint="Trace each insertion step. When a collision occurs, increment the probe counter and re-calculate the index using ." solution="Step 1: Initialize an empty table of size 10.
>
Step 2: Insert . . Position is empty.
>
Step 3: Insert . . Position is occupied.
Probe : . Position is empty.
>
Step 4: Insert . . Position is occupied.
Probe : . Position is occupied.
Probe : . Position is empty.
>
Step 5: Insert . . Position is occupied.
Probe : . Position is occupied.
Probe : . Position is occupied.
Probe : . Position is empty.
>
Step 6: The final position of key is .
"
:::
---
2.2.2. Quadratic Probing
The probe sequence is quadratic: . This helps to reduce primary clustering but can suffer from secondary clustering.
Worked Example: Quadratic Probing
Insert keys into a hash table of size using and quadratic probing with (i.e., ).
Step 1: Insert . .
Table:
Step 2: Insert . .
Table:
Step 3: Insert . . Collision at index .
Try : . Slot is empty.
Table:
Step 4: Insert . . Slot is empty.
Table:
Step 5: Insert . . Collision at index .
Try : . Collision at index .
Try : . Slot is empty.
Table:
Answer: The final state of the hash table using quadratic probing.
:::question type="NAT" question="A hash table of size (prime) uses quadratic probing with and . If keys are inserted in that order, what is the final position of key ?" answer="5" hint="Trace each insertion. For collisions, use ." solution="Step 1: Initialize an empty table of size 11.
>
Step 2: Insert . . Position is empty.
>
Step 3: Insert . . Position is occupied.
Probe : . Position is empty.
>
Step 4: Insert . . Position is occupied.
Probe : . Position is occupied.
Probe : . Position is empty.
>
Step 5: The final position of key is .
"
:::
---
2.2.3. Double Hashing
Uses two auxiliary hash functions, and , to compute the probe sequence. This effectively eliminates primary and secondary clustering.
Worked Example: Double Hashing
Insert keys into a hash table of size using and .
Step 1: Insert . . .
Table:
Step 2: Insert . . .
Table:
Step 3: Insert . . Collision at index .
Calculate .
Try : . Slot is empty.
Table:
Step 4: Insert . . . Slot is empty.
Table:
Step 5: Insert . . . Slot is empty.
Table:
Answer: The final state of the hash table using double hashing.
:::question type="MCQ" question="A hash table of size uses double hashing with and . If keys are inserted in that order, what is the final position of key ?" options=["0","1","2","3"] answer="3" hint="Calculate and for each key. For collisions, use ." solution="Step 1: Initialize an empty table of size 7.
>
Step 2: Insert .
.
. Position is empty.
>
Step 3: Insert .
. Position is occupied.
.
Probe : . Position is empty.
>
Step 4: Insert .
. Position is occupied.
.
Probe : . Position is empty.
>
Step 5: The final position of key is .
Wait, the answer is 3. Let me recheck the calculation for .
.
.
The sequence is .
. (Occupied by )
. (Empty)
So should go to index .
The provided answer is '3'. This implies an error in my solution or the question/answer pair.
Let's assume the question meant "What is the initial hash index for key 24?" which is . But the question asks for the final position.
Let's check the options and my calculation. My calculation leads to 4. Option 3 is given.
Could it be that is defined differently?
If was, for example, and was ? No.
Let's stick to my calculation. The question asks for the final position.
. This is occupied by .
.
: . Occupied.
: . Empty.
So goes to .
If the answer is 3, then it means was inserted at its position directly, which would only happen if that position was empty. But it's occupied by .
This suggests an error in the question or the provided answer. I will stick to my calculated answer and state it. But I must match the provided answer.
This is a critical point. I am an expert content writer, so I must ensure the provided solution matches the correct answer.
If the answer must be 3, then cannot be at 3.
Let's re-run the entire sequence assuming the answer must be 3.
If is at , and is at , and is at .
. Okay, is at .
. Collision. .
. Occupied.
. So is at . This is consistent.
Now .
. Occupied by .
.
. Occupied by .
. Empty.
So should be at .
The provided answer '3' is incorrect based on the problem statement and standard double hashing. I will correct the answer to '4' or modify the question/options to make '3' correct.
Let's change the question slightly to make a valid answer.
If and , . Keys .
. at .
. at .
. at . This still doesn't give .
Let's assume the question expects as the final position, which means no collision for .
This would require and to hash elsewhere.
If was inserted, .
If was inserted, . Collision. .
. Occupied.
. at .
Now insert . . Is it occupied? Yes, by .
So cannot be at .
The only way ends up at position is if position is empty when is inserted.
This means must not be at .
If was not , but something else. For example, if .
If , .
. . . No collisions.
This means my initial answer '4' is correct for the given problem. I will provide '4' as the answer and solution.
The prompt says "answer="Exact option text"". My options are "0", "1", "2", "3". So if 4 is the answer, I need to add 4 to options. Or change the question.
I will add 4 to the options and make sure it is the correct answer.
Revised question and options:
:::question type="MCQ" question="A hash table of size uses double hashing with and . If keys are inserted in that order, what is the final position of key ?" options=["0","1","2","4"] answer="4" hint="Calculate and for each key. For collisions, use ." solution="Step 1: Initialize an empty table of size 7.
>
Step 2: Insert .
.
. Position is empty.
>
Step 3: Insert .
. Position is occupied.
.
Probe : . Position is empty.
>
Step 4: Insert .
. Position is occupied.
.
Probe : . Position is empty.
>
Step 5: The final position of key is .
"
:::
---
3. Load Factor
The load factor of a hash table is the ratio of the number of elements to the table size . It indicates how full the hash table is and directly impacts performance.
Worked Example: Load Factor and Resizing
A hash table of size currently holds elements. We want to insert two more elements. What is the load factor after insertion, and should the table be resized if the threshold is ?
Step 1: Calculate initial load factor.
>
Step 2: Determine number of elements after insertion.
>
Step 3: Calculate new load factor.
>
Step 4: Compare with threshold.
Since , the table should be resized.
Answer: The new load factor is . The table should be resized.
:::question type="NAT" question="A hash table with separate chaining has a size of . After inserting elements, what is the load factor? (Provide answer as a decimal rounded to two places)." answer="1.54" hint="The load factor is . For separate chaining, it can be greater than 1." solution="Step 1: Identify the number of elements () and table size ().
Step 2: Calculate the load factor .
>
Step 3: Round to two decimal places.
>
"
:::
---
4. Universal Hashing
A universal family of hash functions ensures that for any two distinct keys, the probability of collision is at most , where is the table size. This guarantees good average-case performance even against an adversary choosing keys.
A family of hash functions is universal if for any two distinct keys (universe of keys), the number of hash functions for which is at most .
Worked Example: Constructing a Universal Hash Function
Consider a prime number larger than any possible key, and a hash table of size . A common universal family is , where and .
Step 1: Choose (a prime number) and . We want to find a hash function for key .
Let's arbitrarily pick .
Step 2: Calculate the inner modulo operation.
>
>
Step 3: Calculate the outer modulo operation.
>
Answer: The hash function maps key to index .
:::question type="MCQ" question="Which of the following is a key characteristic of a universal hash family?" options=["It guarantees no collisions for any set of keys.","It ensures that the worst-case lookup time is .","It provides a probabilistic guarantee on collision frequency, independent of the input distribution.","It requires all keys to be prime numbers."] answer="It provides a probabilistic guarantee on collision frequency, independent of the input distribution." hint="Universal hashing aims to minimize the probability of collisions on average, not eliminate them. It provides resilience against malicious inputs." solution="Step 1: Analyze the properties of universal hashing.
- Universal hashing does not guarantee no collisions; it makes collisions unlikely on average.
- It does not guarantee worst-case lookup time; in the worst case, all keys could still collide, leading to if not handled well.
- The core strength of universal hashing is that by randomly selecting a hash function from a universal family, the probability of collision for any two distinct keys is low, regardless of which keys are presented. This makes it robust against adversarial inputs.
- It does not require keys to be prime numbers.
Option C accurately describes the probabilistic guarantee that universal hashing offers, making it independent of the input distribution.
"
:::
---
Advanced Applications
Hash tables are widely used in various applications, including symbol tables in compilers, database indexing, caching systems, and implementing sets and maps.
Worked Example: Implementing a Frequency Counter
We want to count the frequency of words in a given text using a hash table. Assume a hash table of size with separate chaining and .
Text: "apple banana apple orange banana"
Step 1: Initialize an empty hash table.
Step 2: Process "apple".
.
Insert "apple" with count 1 into bucket 0.
Table: `[ (apple, 1) ]`
Step 3: Process "banana".
.
Insert "banana" with count 1 into bucket 9.
Table: `[ (apple, 1), ..., (banana, 1) ]`
Step 4: Process "apple".
. Collision at bucket 0. Search for "apple" in the list. Found. Increment count.
Table: `[ (apple, 2) ]`
Step 5: Process "orange".
.
Insert "orange" with count 1 into bucket 6.
Table: `[ (apple, 2), ..., (orange, 1), ..., (banana, 1) ]`
Step 6: Process "banana".
. Collision at bucket 9. Search for "banana" in the list. Found. Increment count.
Table: `[ (apple, 2), ..., (orange, 1), ..., (banana, 2) ]`
Answer: The final frequency counts are: apple: 2, banana: 2, orange: 1.
:::question type="MSQ" question="A cache system uses a hash table to store frequently accessed data. Which of the following design choices would generally improve the performance of such a cache?" options=["Using a very small table size to save memory.","Employing a hash function that distributes keys uniformly.","Implementing separate chaining over open addressing for better worst-case performance.","Increasing the load factor significantly beyond 1.0 for open addressing." ] answer="Employing a hash function that distributes keys uniformly.,Implementing separate chaining over open addressing for better worst-case performance." hint="Consider the impact of each choice on collision rates, average access time, and memory usage." solution="Step 1: Evaluate each option.
- 'Using a very small table size to save memory.': A small table size increases the likelihood of collisions, leading to longer probe sequences or linked lists, thus degrading performance (higher average access time). So, this is incorrect.
- 'Employing a hash function that distributes keys uniformly.': A uniform hash function minimizes collisions, which is crucial for good hash table performance (closer to average access time). This is correct.
- 'Implementing separate chaining over open addressing for better worst-case performance.': While open addressing can offer better cache locality, separate chaining generally has better worst-case performance for search, insert, and delete operations because it gracefully handles high load factors (can exceed 1.0) and avoids situations where an open addressing table becomes completely full or suffers from severe clustering leading to operations. So, this is correct.
- 'Increasing the load factor significantly beyond 1.0 for open addressing.': Open addressing requires a load factor strictly less than 1.0. A load factor greater than or equal to 1.0 means the table is full or potentially in an infinite loop for insertion, which is incorrect and leads to failure. So, this is incorrect.
The correct options are "Employing a hash function that distributes keys uniformly." and "Implementing separate chaining over open addressing for better worst-case performance."
"
:::
---
Problem-Solving Strategies
When choosing a hash function, prioritize uniformity and efficiency. For integer keys, the division method () is simple and effective, especially if is a prime number. For string keys, polynomial hashing (e.g., ) often provides good distribution. Always consider the data distribution if known.
For small, fixed-size datasets or when memory is extremely tight, open addressing (especially quadratic or double hashing) can be efficient due to better cache performance. For dynamic datasets with unpredictable load factors or when worst-case performance guarantees are critical, separate chaining is generally more robust as it handles high load factors gracefully.
---
Common Mistakes
โ Using a hash function that groups many keys to a few specific indices (e.g., for keys that are always multiples of 10) leads to excessive collisions and degrades performance to .
โ
Design hash functions that distribute keys as uniformly as possible across the entire table range. Test with representative data if feasible.
โ Not handling a full hash table in open addressing can lead to infinite loops during insertion (if no empty slot is found).
โ
Always ensure a load factor for open addressing. Implement resizing (rehashing) to a larger table when the load factor approaches a predefined threshold (e.g., or ). For linear probing, if is not prime, the probe sequence might not cover all slots, leading to failure even if the table isn't full.
---
Practice Questions
:::question type="NAT" question="A hash table of size uses linear probing and . If keys are inserted in that order, how many probes (total, including the initial probe ) are required for the insertion of key ?" answer="5" hint="Trace the insertion process for each key, counting probes for . Remember that is the first probe." solution="Step 1: Initialize an empty table of size 10.
>
Step 2: Insert . . (1 probe)
>
Step 3: Insert . . Position is occupied.
Probe : . Position is empty. (2 probes)
>
Step 4: Insert . . Position is occupied.
Probe : . Position is occupied.
Probe : . Position is empty. (3 probes)
>
Step 5: Insert . . Position is occupied.
Probe : . Position is occupied.
Probe : . Position is occupied.
Probe : . Position is empty. (4 probes)
>
Step 6: Insert . . Position is occupied.
Probe : . Position is occupied.
Probe : . Position is occupied.
Probe : . Position is occupied.
Probe : . Position is empty. (5 probes)
>
Step 7: The total number of probes for key is .
"
:::
:::question type="MCQ" question="Which collision resolution technique is most effective at mitigating both primary and secondary clustering?" options=["Linear Probing","Quadratic Probing","Separate Chaining","Double Hashing"] answer="Double Hashing" hint="Consider how each technique generates its probe sequence or handles collisions for different keys." solution="Step 1: Analyze each technique:
- Linear Probing: Suffers from primary clustering, where long runs of occupied slots are created.
- Quadratic Probing: Eliminates primary clustering but can suffer from secondary clustering, where keys that hash to the same initial position follow the same probe sequence.
- Separate Chaining: Does not use probing; it stores colliding elements in linked lists (or other structures) at the hash index. It avoids clustering in the table array itself.
- Double Hashing: Uses two independent hash functions to generate a distinct probe sequence for each key. This effectively prevents both primary and secondary clustering by making probe sequences highly key-dependent.
Double hashing uses a unique step size based on , making each key's probe sequence independent of other keys that might share or even and if the functions are well-chosen.
"
:::
:::question type="MSQ" question="A hash table is used to store records of students identified by their unique roll numbers (integers). Which of the following statements are generally true regarding the performance of this hash table?" options=["The worst-case search time can be regardless of the hash function.","A higher load factor always leads to better memory utilization without performance degradation.","Resizing the table (rehashing) involves re-calculating hash values for all existing elements.","Collisions are less likely to occur if the hash table size is a prime number." ] answer="The worst-case search time can be regardless of the hash function.,Resizing the table (rehashing) involves re-calculating hash values for all existing elements.,Collisions are less likely to occur if the hash table size is a prime number." hint="Consider average vs. worst-case performance, load factor implications, and properties of prime table sizes." solution="Step 1: Evaluate each statement:
- 'The worst-case search time can be regardless of the hash function.': This is true. In the worst case, all keys could hash to the same bucket (separate chaining) or require probes (open addressing), leading to search time. Even with a perfect hash function, an adversarial input can force this worst case.
- 'A higher load factor always leads to better memory utilization without performance degradation.': This is false. While a higher load factor means less empty space, it also increases the average number of collisions and thus the average search/insertion/deletion time. Performance degrades significantly as the load factor increases.
- 'Resizing the table (rehashing) involves re-calculating hash values for all existing elements.': This is true. When a hash table is resized (usually to a larger size), a new hash function might be used (or the same function with a new ), requiring all existing key-value pairs to be re-inserted into the new, larger table.
- 'Collisions are less likely to occur if the hash table size is a prime number.': This is generally true, especially for the division method . Using a prime tends to distribute keys more uniformly than composite numbers, particularly if keys have common factors with .
The correct statements are: "The worst-case search time can be regardless of the hash function.", "Resizing the table (rehashing) involves re-calculating hash values for all existing elements.", and "Collisions are less likely to occur if the hash table size is a prime number."
"
:::
:::question type="NAT" question="A hash table with separate chaining has buckets. We insert elements. What is the average number of elements per bucket?" answer="1.5" hint="Average elements per bucket is the definition of load factor." solution="Step 1: The average number of elements per bucket is equivalent to the load factor .
Step 2: Calculate .
>
"
:::
:::question type="MCQ" question="Which of the following scenarios is most likely to benefit from using a hash table over a balanced binary search tree (BST)?" options=["Storing data that requires frequent range queries (e.g., all elements between and ).","When strict worst-case time complexity for all operations is required.","Implementing a symbol table where average lookup is critical and exact ordering is not needed.","When memory usage is extremely constrained and is very large." ] answer="Implementing a symbol table where average lookup is critical and exact ordering is not needed." hint="Consider the strengths and weaknesses of hash tables vs. BSTs regarding operation complexity and data ordering." solution="Step 1: Compare hash tables and balanced BSTs.
- Hash Tables: Offer average for insert, delete, and search. Worst-case can be . Do not inherently maintain order. Memory overhead can be higher due to empty slots or linked lists.
- Balanced BSTs (e.g., AVL, Red-Black Trees): Offer worst-case for insert, delete, search. Maintain elements in sorted order, allowing efficient range queries.
- 'Storing data that requires frequent range queries (e.g., all elements between and ).': This is a strength of balanced BSTs, not hash tables.
- 'When strict worst-case time complexity for all operations is required.': This is a strength of balanced BSTs, not hash tables (which have worst-case).
- 'Implementing a symbol table where average lookup is critical and exact ordering is not needed.': This is a classic application where hash tables excel. The average performance is superior to for most lookups, and symbol tables often don't require elements to be kept in sorted order.
- 'When memory usage is extremely constrained and is very large.': Hash tables can have significant memory overhead due to empty slots or pointers in separate chaining. Balanced BSTs might be more memory-efficient in some scenarios, especially if the keys are large. This is not a primary benefit of hash tables.
Option C directly aligns with the primary advantage of hash tables.
"
:::
---
Summary
|
| Formula/Concept | Expression |
|---|----------------|------------| | 1 | Division Method Hash | | | 2 | Multiplication Method Hash | | | 3 | Linear Probing | | | 4 | Quadratic Probing | | | 5 | Double Hashing | | | 6 | Load Factor | | | 7 | Separate Chaining | Each table slot points to a linked list of colliding elements. | | 8 | Open Addressing | Colliding elements are placed in other empty slots within the table using probing. | | 9 | Universal Hashing | A family of hash functions offering probabilistic collision guarantees. |---
What's Next?
This topic connects to:
- Disjoint Set Union (DSU): While not directly related in function, DSU structures often use hash tables internally to map arbitrary elements to integer indices for efficient set operations.
- Caching Algorithms: Hash tables are the underlying data structure for many cache implementations (e.g., LRU cache) to provide fast lookups for cached items.
- Bloom Filters: These probabilistic data structures use multiple hash functions to check for set membership efficiently, with a small probability of false positives.
- Cryptographic Hashing: While standard hash tables use non-cryptographic hash functions, the concept of mapping data to a fixed-size output is shared with cryptographic hashing, which requires additional properties like collision resistance.
---
Proceeding to Graph Representations.
---
Part 2: Graph Representations
Graphs are fundamental data structures used to model pairwise relationships between objects. We examine various methods for representing graphs, which significantly impact the efficiency of graph algorithms.
---
Core Concepts
1. Adjacency Matrix
An adjacency matrix represents a graph as a square matrix where rows and columns correspond to vertices. An entry is 1 if an edge exists from vertex to vertex , and 0 otherwise. For weighted graphs, stores the weight of the edge.
Worked Example:
Consider an undirected graph with and . We construct its adjacency matrix.
Step 1: Initialize a matrix with all zeros.
>
Step 2: For each edge , set and (since it's undirected).
>
Answer: The adjacency matrix is as shown. For a directed graph, we would only set .
:::question type="MCQ" question="What is the space complexity to represent a graph with vertices and edges using an adjacency matrix?" options=["","","",""] answer="" hint="The matrix size depends only on the number of vertices." solution="An adjacency matrix requires a grid. Each entry stores a constant amount of information (0, 1, or weight). Therefore, the total space required is proportional to ."
:::
---
2. Adjacency List
An adjacency list represents a graph as an array (or hash table) of lists. Each entry `Adj[u]` in the array contains a list of all vertices such that there is an edge . For weighted graphs, each item in the list can be a pair .
Worked Example:
Consider the same undirected graph with and . We construct its adjacency list.
Step 1: Create an array (or hash map) where each index corresponds to a vertex.
>
>
>
>
Step 2: For each edge , add to 's list and to 's list (for undirected graphs).
>
>
>
>
Answer: The adjacency list representation is as shown. If the graph were directed, say , then , , , .
:::question type="MSQ" question="Which of the following statements are true regarding adjacency lists for a simple graph with vertices and edges?" options=["The space complexity is .","Checking if an edge exists can take time in the worst case.","Iterating over all neighbors of a vertex takes time.","It is generally preferred for dense graphs over adjacency matrices."] answer="The space complexity is ,Checking if an edge exists can take time in the worst case.,Iterating over all neighbors of a vertex takes time." hint="Consider the structure and operations for adjacency lists. For edge existence, a linear scan of a list might be needed. For dense graphs, can be close to ." solution="1. The space complexity is : Correct. Each vertex stores a list head (or pointer), , and each edge appears once in 's list (directed) or twice (undirected, once in 's list and once in 's list), so total space for lists. Thus, .
:::
---
3. Incidence Matrix
An incidence matrix represents a graph with rows corresponding to vertices and columns corresponding to edges. An entry is 1 if vertex is incident to edge , and 0 otherwise. For directed graphs, can be 1 if is the tail, -1 if is the head, and 0 otherwise.
Worked Example:
Consider an undirected graph with and . We construct its incidence matrix.
Step 1: Initialize a matrix with all zeros, where rows are vertices and columns are edges.
>
Step 2: For each edge , set and .
>
Answer: The incidence matrix is as shown.
:::question type="NAT" question="For a directed graph with vertices and edges, what is the exact number of entries in its incidence matrix?" answer="35" hint="The incidence matrix has dimensions ." solution="An incidence matrix has rows (for vertices) and columns (for edges).
The total number of entries is .
Given and , the number of entries is ."
:::
---
4. Comparison of Representations
The choice of graph representation depends on the specific graph properties (dense vs. sparse) and the operations most frequently performed. Adjacency matrices are suitable for dense graphs and edge lookup. Adjacency lists are better for sparse graphs and efficient traversal of neighbors. Incidence matrices are rarely used for general graph algorithms due to higher space complexity and less direct access to neighbor information.
| Feature | Adjacency Matrix | Adjacency List | Incidence Matrix |
|---------------------|------------------|----------------|------------------|
| Space | | | |
| Edge check | | | (scan column) |
| Find neighbors of | | | (scan row) |
| Add edge | | | (add column) |
| Remove edge | | | (remove column) |
Worked Example:
A graph algorithm requires frequent checks for the existence of an edge between any two given vertices. The graph is known to be dense, meaning . Determine the most efficient representation.
Step 1: Analyze the primary operation: checking for edge existence.
For an adjacency matrix, this is an operation.
For an adjacency list, this is in the worst case, which can be for dense graphs.
Step 2: Analyze graph density and space complexity.
For a dense graph, .
Adjacency matrix space: .
Adjacency list space: .
Incidence matrix space: .
Step 3: Compare efficiency for the required operation and space.
Both adjacency matrix and list have comparable space complexity for dense graphs. However, the adjacency matrix provides edge lookup, which is superior to the worst-case for adjacency lists. The incidence matrix is too space-inefficient.
Answer: The adjacency matrix is the most efficient representation for frequent edge existence checks in a dense graph.
:::question type="MCQ" question="Which graph representation is generally most suitable for performing Breadth-First Search (BFS) on a sparse graph, where ?" options=["Adjacency Matrix","Incidence Matrix","Adjacency List","Edge List"] answer="Adjacency List" hint="BFS involves exploring neighbors of a vertex efficiently. Consider the time complexity for finding neighbors for each representation." solution="BFS requires efficiently finding all neighbors of a given vertex.
For a sparse graph, is much smaller than . The time complexity of BFS using an adjacency list is superior to or for other representations."
:::
---
Advanced Applications
Choosing the right representation is critical for algorithm performance, especially when dealing with specific graph types or operations. For instance, some algorithms rely on matrix operations, while others benefit from direct neighbor access.
Worked Example:
We need to implement an algorithm that frequently calculates the degree of all vertices in an undirected graph. The graph can be either sparse or dense. Which representation allows for the most efficient degree calculation for all vertices?
Step 1: Consider the definition of degree. For an undirected graph, is the number of edges incident to .
Step 2: Evaluate degree calculation for each representation:
* Adjacency Matrix: To find , we must sum the entries in row (or column ). This takes time for one vertex. For all vertices, it takes time.
* Adjacency List: To find , we simply count the elements in . This takes time for one vertex. For all vertices, the total time is (since each edge is listed twice).
* Incidence Matrix: To find , we must sum the entries in row . This takes time for one vertex. For all vertices, it takes time.
Step 3: Compare efficiencies.
The adjacency list allows calculating all degrees in time, which is generally faster than or , especially for sparse graphs. Even for dense graphs where , is comparable to but potentially with a smaller constant factor.
Answer: The adjacency list is the most efficient representation for calculating the degrees of all vertices, achieving time complexity.
:::question type="NAT" question="A directed graph has 10 vertices and 15 edges. If it is represented by an adjacency matrix, how many entries will be non-zero?" answer="15" hint="Each directed edge corresponds to exactly one non-zero entry in the adjacency matrix." solution="In an adjacency matrix for a directed graph, an edge is represented by a single non-zero entry .
The number of non-zero entries in the adjacency matrix is equal to the number of directed edges in the graph.
Given edges, there will be 15 non-zero entries."
:::
---
Problem-Solving Strategies
When faced with a graph problem, first identify:
- Graph Density: Is it sparse () or dense ()?
- Key Operations: What are the most frequent operations (edge lookup, neighbor traversal, adding/removing vertices/edges)?
- Space Constraints: Are there memory limitations?
For sparse graphs and neighbor-centric algorithms (BFS, DFS, Dijkstra), adjacency lists are usually preferred. For dense graphs and quick edge existence checks (Floyd-Warshall, some matrix-based algorithms), adjacency matrices are often better. Incidence matrices are rarely optimal for general graph algorithms.
---
Common Mistakes
โ Mistake: Treating directed graph representations the same as undirected. For example, setting and for a directed edge in an adjacency matrix.
โ
Correct approach: For a directed graph, an edge (from to ) only sets in an adjacency matrix or adds to 's adjacency list. remains 0, and is not added to 's list, unless there is also a directed edge . For undirected graphs, the matrix is symmetric, and entries are added to both lists.
โ Mistake: Assuming adjacency list space complexity is always less than adjacency matrix space complexity without considering graph density.
โ
Correct approach: Adjacency list space is , while adjacency matrix space is . If the graph is dense (i.e., ), then becomes , making their space complexities asymptotically equivalent. The constant factors might differ, but the asymptotic behavior is the same.
---
Practice Questions
:::question type="MCQ" question="Consider an undirected graph with vertices and edges. Which of the following statements is true about its adjacency matrix ?" options=[" is always symmetric.","The sum of all entries in equals .","The diagonal entries represent self-loops.","If the graph is connected, all entries in must be 1."] answer=" is always symmetric." hint="Recall the definition of an undirected graph and how edges are represented in its adjacency matrix." solution="1. is always symmetric: Correct. For an undirected edge , we set and . Thus, for all , making the matrix symmetric.
:::
:::question type="NAT" question="A complete undirected graph has vertices. How many elements are in its adjacency list representation (sum of lengths of all lists)?" answer="N*(N-1)" hint="In a complete graph, every vertex is connected to every other vertex. Each edge appears twice in the adjacency lists." solution="In a complete undirected graph , every vertex is connected to every other vertices.
The degree of each vertex is .
The sum of the lengths of all adjacency lists is .
For an undirected graph, this sum is .
The number of edges in is .
Therefore, the sum of the lengths of all lists is ."
:::
:::question type="MSQ" question="Which of the following operations are generally more efficient using an adjacency matrix compared to an adjacency list for a dense graph?" options=["Checking if an edge exists between two specific vertices.","Adding a new vertex to the graph.","Iterating over all neighbors of a specific vertex.","Finding the transpose of a directed graph."] answer="Checking if an edge exists between two specific vertices.,Finding the transpose of a directed graph." hint="Consider the time complexity of each operation for both representations, especially for dense graphs where ." solution="1. Checking if an edge exists between two specific vertices: Correct. Adjacency matrix: . Adjacency list: in worst case for dense graph.
:::
:::question type="MCQ" question="An algorithm needs to frequently determine if a vertex has any outgoing edges. Which representation allows for the fastest check for a directed graph?" options=["Adjacency Matrix","Adjacency List","Incidence Matrix","Edge List"] answer="Adjacency List" hint="Consider how outgoing edges are stored in each representation." solution="1. Adjacency Matrix: To check if has any outgoing edges, we must scan row . This takes time.
Therefore, checking if a vertex has any outgoing edges is fastest with an adjacency list, taking time."
:::
:::question type="NAT" question="A graph has vertices and edges. If it is an undirected graph with no self-loops, what is the sum of degrees of all vertices?" answer="18" hint="The sum of degrees in an undirected graph is related to the number of edges." solution="For any undirected graph, the sum of the degrees of all vertices is equal to twice the number of edges. This is known as the Handshaking Lemma.
Sum of degrees .
Given , the sum of degrees ."
:::
---
Summary
|
| Formula/Concept | Expression |
|---|----------------|------------| | 1 | Adjacency Matrix Space | | | 2 | Adjacency List Space (Directed) | | | 3 | Adjacency List Space (Undirected) | | | 4 | Incidence Matrix Space | | | 5 | Edge Check (Adj. Matrix) | | | 6 | Edge Check (Adj. List) | | | 7 | Sum of Degrees (Undirected) | |---
What's Next?
This topic connects to:
- Graph Traversal Algorithms (BFS, DFS): These algorithms directly utilize graph representations, especially adjacency lists, for efficient neighbor exploration.
- Shortest Path Algorithms (Dijkstra, Bellman-Ford): The performance of these algorithms is heavily influenced by the chosen graph representation, particularly for accessing edge weights and neighbors.
- Minimum Spanning Tree Algorithms (Prim, Kruskal): These algorithms also rely on efficient graph representation for managing edges and connectivity.
---
Chapter Summary
- Hash Tables: Offer average time complexity for insertions, deletions, and lookups by mapping keys to array indices via a hash function.
- Hashing Functions: Critical for distributing keys uniformly across the table, minimizing collisions, and ensuring efficient performance.
- Collision Resolution: Strategies such as chaining (using linked lists) or open addressing (e.g., linear probing, quadratic probing, double hashing) are employed to handle multiple keys mapping to the same hash index.
- Load Factor: Defined as . A high load factor significantly degrades hash table performance, often necessitating rehashing to a larger table.
- Graph Representations: Graphs are primarily represented using either adjacency matrices or adjacency lists, each with distinct space and time complexity trade-offs.
- Adjacency Matrix: An matrix where indicates an edge between vertex and . Space complexity is , making it suitable for dense graphs or when edge existence checks are paramount.
- Adjacency List: An array of lists, where contains all vertices adjacent to vertex . Space complexity is , making it generally more memory-efficient for sparse graphs.
---
Chapter Review Questions
:::question type="MCQ" question="A hash table uses chaining for collision resolution and has a size of 10. The hash function is . Keys are inserted in the following order: 12, 23, 34, 45, 56, 13, 24, 35, 46. Assuming a worst-case search for a key that exists, what is the maximum number of key comparisons required to find any key currently in the table?" options=["1","2","3","4"] answer="2" hint="Trace the insertions and observe the length of the longest chain. The number of comparisons for a key in a chain is its position in the chain." solution="Let's trace the insertions and the state of the hash table with chaining:
*
*
*
*
*
*
*
*
*
The final state of the relevant chains:
* (1 comparison for 12)
* (1 comparison for 23, 2 for 13)
* (1 comparison for 34, 2 for 24)
* (1 comparison for 45, 2 for 35)
* (1 comparison for 56, 2 for 46)
The maximum number of comparisons for any existing key is 2."
:::
:::question type="NAT" question="Consider an undirected graph with vertices and edges. If an adjacency matrix requires units of space and an adjacency list requires units of space (for an undirected graph), what is the approximate ratio (rounded to the nearest integer) of the space required by the adjacency matrix to that required by the adjacency list?" answer="9" hint="Calculate the space for each representation using the given formulas and then find their ratio." solution="Space for Adjacency Matrix = units.
Space for Adjacency List = units.
Ratio =
Rounded to the nearest integer, the ratio is 9."
:::
:::question type="MCQ" question="Which of the following scenarios most strongly suggests that an adjacency list representation would be more appropriate for a graph than an adjacency matrix?" options=["The graph is dense, with approaching .","Frequent checks for the existence of an edge between any two specific vertices are a critical requirement.","The graph is sparse, and memory efficiency for storing edges is a primary concern.","The number of vertices is very small, making a manageable constant." ] answer="The graph is sparse, and memory efficiency for storing edges is a primary concern." hint="Consider the space complexity of each representation relative to the graph's density and the time complexity of common operations." solution="* The graph is dense, with approaching : An adjacency matrix ( space) is generally preferred for dense graphs because its space usage is proportional to the number of edges, and it offers edge lookup.
* Frequent checks for the existence of an edge between any two specific vertices are a critical requirement: An adjacency matrix provides edge existence checks, which is superior to an adjacency list's worst-case.
* The graph is sparse, and memory efficiency for storing edges is a primary concern: An adjacency list uses space, which is significantly more memory-efficient than for sparse graphs where . This is the strongest reason to prefer an adjacency list.
The number of vertices is very small, making a manageable constant: While small makes an adjacency matrix feasible, it doesn't necessarily make it more* appropriate than an adjacency list, especially if the graph is sparse. The choice still depends on density and operation priorities.
Therefore, for sparse graphs where memory efficiency is key, an adjacency list is preferred."
:::
---
What's Next?
Building upon the foundational understanding of hash tables and graph representations from this chapter, your CMI preparation should next focus on the algorithms that leverage these structures. For hash tables, delve into their applications in data structures like sets and maps, and advanced hashing techniques. For graphs, explore core algorithms such as Breadth-First Search (BFS), Depth-First Search (DFS), shortest path algorithms (e.g., Dijkstra's, Bellman-Ford), and Minimum Spanning Tree (MST) algorithms (e.g., Prim's, Kruskal's). A solid grasp of these algorithms and their complexity analysis is paramount.