2026-01-26
By the end of this lecture, you should be able to:
In the random-access machine model, arrays provide constant-time access to elements at any given index. However, checking whether a specific value exists in an unsorted array of size \(n\) requires scanning the entire array. This process has a worst-case running time of \(\Theta(n)\), which occurs when the value is absent.
Additionally, inserting and deleting elements in arrays is inefficient if you want to maintain as much of the previous sequence as possible. For instance, inserting an element at the beginning of an array requires shifting all subsequent elements one position to the right, resulting in a worst-case running time of \(\Theta(n)\).
In this lecture, we will explore linked lists and hash tables, which potentially accelerate these operations.
Definition
A singly linked list is a data structure where each element contains both a key and a pointer to the next element in the list. The end of the list is marked by a special pointer, nil. Additionally, the list possesses a \(\mathit{head}\) attribute, which points to the first element in the list.
A doubly linked list extends this structure by adding either a pointer to the previous element or nil for the first element.
To search a linked list, start at the head and traverse the list by following the pointers from one element to the next. The traversal continues until either the desired key is found or the end of the list is reached. If the key is found, a pointer to the key is returned; otherwise, the procedure returns nil.
\begin{algorithm}
\begin{algorithmic}
\Procedure{List-Search}{$L$, $k$}
\State $x = L.\mathit{head}$
\While{$x \neq$ \textsc{nil} and $x.\mathit{key} \neq k$}
\State $x = x.\mathit{next}$
\EndWhile
\Return $x$
\EndProcedure
\end{algorithmic}
\end{algorithm}
In the example below, Search\((L, 10)\) returns 3E.
When inserting a new key into a linked list, two cases must be distinguished:
List-Prepend\((L, x)\):
The new key, pointed to by \(x\), is inserted at the beginning of the list.
List-Insert\((x, y)\):
The new key is to be inserted after an existing key. Here, \(x\) is assumed to be a pointer to the new key and \(y\) a pointer to the existing key. The list \(L\) is not a parameter of List-Insert because only the existing list element \(y\) is required as input, not the entire list.
The procedures for both cases are detailed on the following slides, where the list is assumed to be doubly-linked.
\begin{algorithm}
\begin{algorithmic}
\Procedure{List-Prepend}{$L$, $x$}
\State $x.\mathit{next} = L.\mathit{head}$
\State $x.\mathit{prev} = $ \textsc{nil}
\If{$L.\mathit{head} \neq$ \textsc{nil}}
\State $L.\mathit{head}.\mathit{prev} = x$
\EndIf
\State $L.\mathit{head} = x$
\EndProcedure
\end{algorithmic}
\end{algorithm}
The example below presents the result of calling List-Prepend\((L, x)\), where \(x\) points to the address 4D and \(x.\mathit{key} = 15\):
\begin{algorithm}
\begin{algorithmic}
\Procedure{List-Insert}{$x$, $y$}
\State $x.\mathit{next} = y.\mathit{next}$
\State $x.\mathit{prev} = y$
\If{$y.\mathit{next} \neq$ \textsc{nil}}
\State $y.\mathit{next}.\mathit{prev} = x$
\EndIf
\State $y.\mathit{next} = x$
\EndProcedure
\end{algorithmic}
\end{algorithm}
The example below illustrates the result of calling List-Insert\((x, y)\), where \(x\) points to the address 1B, \(x.\mathit{key} = 9\), and \(y\) is the element at the address 3E:
The following procedure removes the element pointed to by \(x\) from the list \(L\):
\begin{algorithm}
\begin{algorithmic}
\Procedure{List-Delete}{$L$, $x$}
\If{$x.\mathit{prev} \neq$ \textsc{nil}}
\State $x.\mathit{prev}.\mathit{next} = x.\mathit{next}$
\Else
\State $L.\mathit{head} = x.\mathit{next}$
\EndIf
\If{$x.\mathit{next} \neq$ \textsc{nil}}
\State $x.\mathit{next}.\mathit{prev} = x.\mathit{prev}$
\EndIf
\EndProcedure
\end{algorithmic}
\end{algorithm}
The example below demonstrates the result of calling List-Delete\((L, x)\), where \(x\) points to the address 2A:
| Operation | Worst-Case Running Time | Reason |
|---|---|---|
| List-Search | \(\Theta(n)\) | Must examine all elements if key is not in the list. |
| List-Prepend List-Insert List-Delete |
\(\Theta(1)\) | Involves only pointer updates. Note that the pointer to the element to be inserted or deleted must be provided as an argument. If only the key is known, a \(\Theta(n)\) search must be performed first. |
Compared to an unsorted array, a doubly-linked list does not reduce the asymptotic growth rate of the search operation in the worst case. However, insertions and deletions are faster than the \(\Theta(n)\) time required for an array.
Both arrays and linked lists can be used to implement sets:
Definition
A set is an unordered collection of unique elements.
For instance, the arrays and linked lists shown below represent the same set, \(\{0, 6, 7, 10\}\). The order of the elements is arbitrary and does not affect the representation of the set.
Definition
A dictionary is a data structure that stores a set of elements, referred to as keys, and supports the following operations:
Because keys constitute a set, they must be unique.
Definition
Satellite data are objects associated with keys in a dictionary. These associations remain unchanged during any dictionary operation.
Examples:
In principle, an unsorted array can be used as a dictionary:
The \(\Theta(n)\) worst-case running time of the search operation renders unsorted arrays impractical for large dictionaries.
By storing keys in a sorted array and applying binary search, the search operation can be improved to \(\Theta(\log n)\) time.
However, insertion and deletion are slower than for unsorted arrays—\(\Theta(n)\) in the worst case—because the keys must be shifted to maintain the sorted order. For example, if the key to be added or deleted is the minimum in the set, \(\Theta(n)\) shifts are required.
Linked lists exhibit the same asymptotic growth rates for worst-case running times as unsorted arrays:
Sorting the linked list does not improve the search time because the list must still be traversed sequentially from the head to locate the key.
Assume the keys to be searched or inserted into a dictionary are drawn from identical probability distributions, with each key equally likely to be deleted. Under these assumptions, the average running times of dictionary operations exhibit the same growth rates as their worst-case counterparts:
| Operation | Unsorted Array | Sorted Array | Linked List |
|---|---|---|---|
| Search | \(\Theta(n)\) | \(\Theta(\log n)\) | \(\Theta(n)\) |
| Insert | \(\Theta(1)\) | \(\Theta(n)\) | \(\Theta(1)\) |
| Delete | \(\Theta(1)\) | \(\Theta(n)\) | \(\Theta(1)\) |
In summary, arrays and linked lists are suboptimal for implementing dictionaries because at least one of the three dictionary operations—search, insert, or delete—is slow, with running times growing linearly with the dictionary size.
On the following page, we introduce direct-address tables, which allow all dictionary operations to run in \(O(1)\) time in the worst case.
Definition
A direct-address table \(T\) is an array-based data structure that stores a set of keys, each of which is an integer in the range from \(0\) to \(m - 1\), where \(m\) is the table size. The set of all possible keys \(\{0, 1, \ldots, m-1\}\) is called the universe.
For each key \(k\) in the universe, the table contains a slot \(T[k]\) that can store the key and any associated satellite data. If \(k\) is not in the set to be stored, \(T[k]\) is assigned the value nil.
The illustration on the following page depicts a direct-address table with a universe of size \(m = 10\) and keys in \(\{1, 4, 5, 9\}\).
Each of the three dictionary operations can be implemented in \(O(1)\) time using direct-address tables:
\begin{algorithm}
\begin{algorithmic}
\Procedure{Direct-Address-Search}{$T$, $k$}
\Return $T[k]$
\EndProcedure
\Procedure{Direct-Address-Insert}{$T$, $x$}
\State $T[x.\mathit{key}] = x$
\EndProcedure
\Procedure{Direct-Address-Delete}{$T$, $x$}
\State $T[x.\mathit{key}] = $ \textsc{nil}
\EndProcedure
\end{algorithmic}
\end{algorithm}
Although fast, direct-address tables have significant drawbacks:
To address these limitations, we will now introduce hash tables, which generally reduce memory requirements to \(\Theta(|K|)\) at the expense of an \(O(|K|)\) worst-case running time for dictionary operations.
However, the average running time for hash tables is \(O(1)\) under realistic assumptions. Moreover, with well-designed hash functions, it is highly unlikely to experience the worst-case scenario.
Definition
A hash table \(T\) is an array-based data structure that stores a set of keys from a universe \(U\) by mapping them to array indices using a hash function \(h: U \to \{0, 1, \ldots, m-1\}\), where \(m\) is the array size. Specifically, the key \(k\) is stored in the slot \(T[h(k)]\).
We say that “the key \(k\) hashes to slot \(h(k)\)” and that “\(h(k)\) is the hash value of key \(k\).”
If two distinct keys \(k_i\) and \(k_j\) hash to the same slot, we encounter a collision.
We will assume that \(h\) can be computed in \(O(1)\) time. A direct-address table is a special case of a hash table where \(h\) is the identity function.
The hash function \(h\) maps the keys to the slots in the hash table. In the plot below, \(k_1\) and \(k_2\) hash to the same slot, causing a collision.
Definition
A hash function \(h\) is said to be independent and uniform if it satisfies the following properties:
Uniform hashing is an idealized model that helps us analyze the expected behavior and key properties of hash tables, such as collision frequency and average running times.
In practice, hash functions are not independent and uniform. Instead, they are designed to be computationally efficient and deterministic. However, with careful hash-function design, we can approximate the ideal behavior in practice.
Here are common techniques to generate hash functions:
These techniques will be discussed in the following slides.
Assume that every key is a nonnegative integer. If necessary, a surrogate key can be created by mapping each input key to a unique nonnegative integer (e.g., \(\text{A} \to 0\), \(\text{B} \to 1\), etc.).
The division method generates hash values by applying simple arithmetic to the nonnegative integer key \(k\):
\[\begin{equation*} h(k) = k \bmod m, \end{equation*}\]
where:
To help the division method spread keys more evenly, choose \(m\) to be a prime number. Using a prime breaks many simple patterns in the keys (for example, when lots of keys share the same last digit or are multiples of a fixed number), which otherwise can cause many keys to land in the same slots. Also avoid choosing \(m\) close to a power of 2 (2, 4, 8, 16, …), since patterns in the low-order bits can then show up directly in \(k \bmod m\).
Using the division method with \(m=11\), the hash values for the keys 56, 29, 90, 40, 82, 30, and 4 are computed as follows:
\(h(56) = 1\), \(h(29) = 7\), \(h(90) = 2\), \(h(40) = 7\), \(h(82) = 5\), \(h(30) = 8\), and \(h(4) = 4\).
Here, the keys 29 and 40 collide because they hash to the same slot: \(h(29)=h(40)=7\).
What is the hash value of the key 100 when using the division method with the divisor \(m = 13\)?
The multiplication method applies the hash function
\[\begin{equation*} h(k) = \lfloor m (k A \bmod 1) \rfloor, \end{equation*}\]
where:
Consider a hash table of size \(m = 1000\) and a corresponding hash function \(h(k)= \lfloor m (kA \bmod 1) \rfloor\) for \(A = (\sqrt{5} - 1) / 2\). Compute the locations to which the keys 61, 62, 63, 64, and 65 are mapped.
The hash values are 700, 318, 936, 554 and 172, respectively.
Because \(m\) is not an integer power of 2, we cannot use bit shifting to compute the multiplication. Therefore, this hash function is not efficient.
Moreover, all hash values are even numbers, indicating that this hash function does not achieve uniform hashing. Thus, it is unsuitable for practical use.
Definition
A hashing algorithm is called universal if it satisfies the following two conditions:
Universal hashing mitigates worst-case scenarios where many keys hash to the same slot during each execution of the program.
For examples, refer to Section 11.3.4 in Cormen et al. (2022).
Even the best hash functions cannot completely eliminate collisions between keys. There are two common approaches to resolving collisions in hash tables:
In the example depicted below, the keys \(k_4\) and \(k_5\) collide. Thus, they are stored together with their values in a linked list:
The following definition is useful for expressing the average running time of dictionary operations when using chaining:
Definition
The load factor \(\alpha\) is the ratio of the number of keys \(n\) stored in the hash table to the number of slots \(m\) in the table:
\[\begin{equation*} \alpha = \frac{n}{m} \end{equation*}\]
The load factor \(\alpha\) can be interpreted as the average length of a linked list associated with a randomly chosen slot.
\begin{algorithm}
\begin{algorithmic}
\Procedure{Chained-Hash-Search}{$T$, $k$}
\Return \textsc{List-Search}($T[h(k)]$, $k$)
\EndProcedure
\Procedure{Chained-Hash-Insert}{$T$, $x$}
\State \textsc{List-Prepend}($T[h(x.\mathit{key})]$, $x$)
\EndProcedure
\Procedure{Chained-Hash-Delete}{$T$, $x$}
\State \textsc{List-Delete}($T[h(x.\mathit{key})]$, $x$)
\EndProcedure
\end{algorithmic}
\end{algorithm}
| Operation | Average Running Time | Comments |
|---|---|---|
| Chained-Hash-Search | \(O(1 + \alpha)\) | Derived in Section 11.2 of Cormen et al. (2022). |
| Chained-Hash-Insert | \(O(1)\) | |
| Chained-Hash-Delete | \(O(1)\) | Assuming the list is doubly linked. |
While chaining resolves collisions by storing linked lists outside the hash table, open addressing stores all keys directly in the slots of the hash table. Each slot contains either a key or nil.
Unlike chaining, open addressing allows at most one key per slot. Consequently, the load factor \(\alpha\) can never exceed 1. If a user attempts to insert a key into a full hash table, an error message must be displayed.
When searching for a key, the algorithm systematically examines table slots until it either finds the desired key or determines that the key is not in the table.
To perform insertion using open addressing, we probe the hash table until an empty slot is found for the key. The sequence of probes depends on the key being inserted.
To determine which slots to probe, the hash function is extended to include the probe number as a second input:
\[\begin{equation*} h: U \times \{0, 1, \ldots, m - 1\} \to \{0, 1, \ldots, m - 1\}. \end{equation*}\]
The probe sequence \(\langle h(k, 0), h(k, 1), \ldots, h(k, m - 1) \rangle\) must be a permutation of \(\langle 0, 1, \ldots, m - 1 \rangle\), ensuring that every hash-table position is eventually considered as a slot for a new key as the table fills up.
For simplicity, assume that no keys have been deleted from the hash table so far. The Hash-Insert-Without-Deleted procedure takes as input a hash table \(T\) and a key \(k\).
The procedure either returns the slot number where \(k\) is stored or flags an error if the hash table is already full:
\begin{algorithm}
\begin{algorithmic}
\Procedure{Hash-Insert-Without-Deleted}{$T$, $k$}
\State $i = 0$
\Repeat
\State $q = h(k, i)$
\If{$T[q] \texttt{==}$ \textsc{nil}}
\State $T[q] = k$
\Return $q$
\Else
\State $i = i + 1$
\EndIf
\Until{$i$ \texttt{==} $m$}
\State \textbf{error} ``hash table overflow''
\EndProcedure
\end{algorithmic}
\end{algorithm}
The algorithm for searching for a key \(k\) probes the same sequence of slots that was examined when key \(k\) was inserted.
\begin{algorithm}
\begin{algorithmic}
\Procedure{Hash-Search}{$T$, $k$}
\State $i = 0$
\Repeat
\State $q = h(k, i)$
\If{$T[q] \texttt{==} k$}
\Return $q$
\EndIf
\State $i = i + 1$
\Until{$T[q] \texttt{==} $\textsc{nil} or $i \texttt{==} m$}
\Return \textsc{nil}
\EndProcedure
\end{algorithmic}
\end{algorithm}
Deletion from an open-address hash table is challenging. When a key is deleted from slot \(q\), we cannot simply mark that slot as empty by storing nil in it. Doing so might prevent the retrieval of any key whose insertion involved probing slot \(q\) and finding it occupied.
We can solve this problem by marking the slot as deleted instead of nil.
In the figure below, the hash function is assumed to be \(h(k, i) = (k + i) \bmod 5\).
If slot 2 is marked as nil after deleting 32, Hash-Search\((T, 76)\) would return nil, incorrectly indicating that key 76 is not in the hash table.
However, if slot 2 is marked as deleted, Hash-Search\((T, 76)\) finds key 76 in slot 3.
If keys have been deleted from the hash table, the insert procedure must also check for slots marked deleted. Such slots can be reused for new keys:
\begin{algorithm}
\begin{algorithmic}
\Procedure{Hash-Insert}{$T$, $k$}
\State $i = 0$
\Repeat
\State $q = h(k, i)$
\If{$T[q] \texttt{==}$ \textsc{nil} or $T[q] \texttt{==}$ \textsc{deleted}}
\State $T[q] = k$
\Return $q$
\Else
\State $i = i + 1$
\EndIf
\Until{$i$ \texttt{==} $m$}
\State \textbf{error} ``hash table overflow''
\EndProcedure
\end{algorithmic}
\end{algorithm}
The only difference from Hash-Insert-Without-Deleted is the additional check for deleted in line 5.
Hash functions used for open addressing should ideally perform uniform hashing, defined by two criteria:
Violating either criterion can cause clustering of keys in certain parts of the hash table, which in turn can degrade the performance of the hash table.
The implementation of true uniform hashing is challenging. Most hash functions that are used in practice do not generate all of the \(m!\) possible permutations.
In this lesson, we discuss two methods that guarantee that the probe sequence \(\langle h(k, 0), h(k, 1), \ldots, h(k, m - 1) \rangle\) is a permutation of \(\langle 0, 1, \ldots, m - 1\rangle\) for each key \(k\):
Given a hash function \(h': U \to \{0, 1, \ldots, m - 1\}\), which we refer to as an auxiliary hash function, the method of linear probing uses the hash function
\[\begin{equation*} h(k, i) = (h'(k) + i) \bmod m \end{equation*}\]
for \(i = 0, 1, \ldots, m - 1\).
In the example on the right, keys were inserted in the sequence \(\langle 42, 53, 14, 92, 27, 67 \rangle\), \(h'(k) = k\) and \(m = 13\). The result exhibits clustering, whereby long sequences of occupied slots are created, leading to an increase in the average search time.
Double hashing uses a hash function of the form
\[\begin{equation*} h(k, i) = (h_1(k) + i h_2(k)) \bmod m, \end{equation*}\]
where both \(h_1\) and \(h_2\) are auxiliary hash functions.
The value of \(h_2(k)\) must be relatively prime to \(m\) so that the entire hash table can be searched. This property can be established, for example, in either of the following two ways:
In the example on the right,
\[\begin{equation*} h(k, i) = (h_1(k) + i h_2(k)) \bmod m, \end{equation*}\]
with \(h_1(k) = k\) and \(h_2(k) = 1 + (k \bmod 12)\). As before, keys were inserted in the sequence \(\langle 42, 53, 14, 92, 27, 67 \rangle\). Note that we never had to probe any sequence beyond \(i = 1\), which provides evidence that double hashing is less prone to clustering than linear probing.
Although double hashing produces only \(m^2\) out of the \(m!\) possible probe sequences, its performance is practically as good as the ideal scheme of uniform hashing.
One can show that the average time needed for searching an open-address hash table depends on the load factor \(\alpha = n / m\) as follows:
Note that \(n\) includes the number of deleted keys because they still occupy slots in the hash table.
As the hash fills up (i.e. \(\alpha\) approaches 1 from below), both of the upper bounds \(O\left(\frac{1}{1 - \alpha}\right)\) and \(O\left(\frac{1}{\alpha} \log \frac{1}{1 - \alpha}\right)\) diverge. When \(\alpha\) exceeds 1, open-addressing cannot be used for resolving collisions because there is insufficient space in the hash table.
However, if we can guarantee that \(\alpha\) never exceeds a constant \(< 1\), then searching only needs \(O(1)\) time.
While hash tables are efficient data structures for inserting, searching, and deleting keys, they are not suitable for all applications. For instance, they are not well-suited for finding the smallest or largest key in a set.
Next week, we will discuss binary search trees, which are data structures that enable fast retrieval of minimal and maximal keys while still allowing efficient search, insertion, and deletion of keys.
Week 04: Lists and Hash Tables (Lecture)