Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

You have nothing in your shopping cart yet.

Title: Data Structure Chap10
Description: Chapter 10: Hashing

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above


Chapter 10: Hashing

Objectives
Looking ahead – in this chapter, we’ll consider
• Hash Functions
• Collision Resolution
• Deletion
• Perfect Hash Functions
• Rehashing
• Hash Functions for Extendible Files

Data Structures and Algorithms in C++, Fourth Edition

2

Introduction
• In earlier chapters, the main process used by the searching
techniques was comparing keys
• In sequential search for instance, the table storing the
elements is searched in order, using key comparisons to
determine a match
• In binary searching, we successively divide the table into
halves to determine which cell to check, and again use key
comparison to determine a match
• In binary search trees, the direction to take in the tree is
determined by comparing keys in the nodes
• A different way to search can be based on calculating the
position of the key in the table, based on the key’s value

Data Structures and Algorithms in C++, Fourth Edition

3

Introduction (continued)
• Since the value of the key is the only indication of position, if
the key is known, we can access the table directly
• This reduces the search time from O(n) or O(lg n) to 1 or at
least O(1)
• No matter how many elements there are, the run time is the
same
• Unfortunately, this is just an ideal; in real applications we can
only approximate this
• The task is to develop a function, h, that can transform a key,
K, into an index for a table used to store items of the same
type as K
• The function h is called a hash function

Data Structures and Algorithms in C++, Fourth Edition

4

Introduction (continued)
• If h is able to transform different key values into different hash
values, it is called a perfect hash function
• For the hash function to be perfect, the table must have as
many positions as there are items to be hashed
• However, it is not always possible to know how many
elements will be hashed in advance, so some estimating is
needed
• Consider a symbol table for a compiler, to store all the
variable names
• Given the nature of the variable names typically used, a table
with 1000 positions may be more than adequate
• However, even if we wanted to handle all possible variable
names, we still need to design an appropriate h
Data Structures and Algorithms in C++, Fourth Edition

5

Introduction (continued)
• For example, we could define h to be the sum of the ASCII
values of the letters in the variable name
• If we restrict variables to 31 letters, we will need 3782
positions, since a variable with of 31 characters all “z” would
sum to 31 ∙ 122 (the ASCII code for “z”) = 3782
• Even then, the function will not produce unique values, for
h(“abc”) = 97 + 98 + 99 = 294, and h(“acb”) = 97 + 99 + 98 =
294
• This is called a collision, and is a measure of the usefulness of
a hash function
• Avoiding collisions can be achieved by making h more
complex, but complexity and speed must be balanced
Data Structures and Algorithms in C++, Fourth Edition

6

Hash Functions
• The total possible number of hash functions for n items
assigned to m positions in a table (n < m) is mn
• The number of perfect hash functions is equal to the number
m
of different placements of these items, and is m−!n !
• With 50 elements and a 100-position array, we would have a
total of 10050 hash functions and about 1094 perfect hash
functions (about 1 in a million)
• Most of the perfect hashes are impractical and cannot be
expressed in a simple formula
• However, even among those that can, a number of
possibilities exist

Data Structures and Algorithms in C++, Fourth Edition

7

Hash Functions (continued)
• Division
– Hash functions must guarantee that the value they produce is a valid
index to the table
– A fairly easy way to ensure this is to use modular division, and divide
the keys by the size of the table, so h(K) = K mod TSize where TSize =
sizeof(table)
– This works best if the table size is a prime number, but if not, we can
use h(K) = (K mod p) mod TSize for a prime p > TSize
– However, nonprimes work well for the divisor provided they do not
have any prime factors less than 20
– The division method is frequently used when little is known about the
keys

Data Structures and Algorithms in C++, Fourth Edition

8

Hash Functions (continued)
• Folding
– In folding, the keys are divided into parts which are then combined (or
“folded”) together and often transformed into the address
– Two types of folding are used, shift folding and boundary folding
– In shift folding, the parts are placed underneath each other and then
processed (for example, by adding)
– Using a Social Security number, say 123-45-6789, we can divide it into
three parts - 123, 456, and 789 – and add them to get 1368
– This can then be divided modulo TSize to get the address
– With boundary folding, the key is visualized as being written on a piece
of paper and folded on the boundaries between the parts

Data Structures and Algorithms in C++, Fourth Edition

9

Hash Functions (continued)
• Folding (continued)
– The result is that alternating parts of the key are reversed, so the
Social Security number part would be 123, 654, 789, totaling 1566
– As can be seen, in both versions, the key is divided into even length
parts of some fixed size, plus any leftover digits
– Then these are added together and the result is divided modulo the
table size
– Consequently this is very fast and efficient, especially if bit strings are
used instead of numbers
– With character strings, one approach is to exclusively-or the individual
character together and use the result
– In this way, h(“abcd”) = “a” ⋁ “b” ⋁ “c” ⋁ “d”

Data Structures and Algorithms in C++, Fourth Edition

10

Hash Functions (continued)
• Folding (continued)
– However, this is limited, because it will only generate values between
0 and 127
– A better approach is to use chunks of characters, where each chunk
has as many characters as bytes in an integer
– On the IBM PC, integers are often 2 bytes long, so h(“abcd”) = “ab” ⋁
“cd”, which would then be divided modulo TSize

Data Structures and Algorithms in C++, Fourth Edition

11

Hash Functions (continued)
• Mid-Square Function
– In the mid-square approach, the numeric value of the key is squared
and the middle part is extracted to serve as the address
– If the key is non-numeric, some type of preprocessing needs to be
done to create a numeric value, such as folding
– Since the entire key participates in generating the address, there is a
better chance of generating different addresses for different keys
– So if the key is 3121, 31212 = 9,740,641, and if the table has 1000
locations, h(3121) = 406, which is the middle part of 31212
– In application, powers of two are more efficient for the table size and
the middle of the bit string of the square of the key is used
– Assuming a table size of 1024, 31212 is represented by the bit string
1001010 0101000010 1100001, and the key, 322, is in italics

Data Structures and Algorithms in C++, Fourth Edition

12

Hash Functions (continued)
• Extraction
– In the extraction approach, the address is derived by using a portion of
the key
– Using the SSN 123-45-6789, we could use the first four digits, 1234,
the last four 6789, or the first two combined with the last two 1289
– Other combinations are also possible, but each time only a portion of
the key is used
– With careful choice of digits, this may be sufficient for address
generation
– For example, some universities give international students ID numbers
beginning with 999; ISBNs start with digits representing the publisher
– So these could be excluded from the address generation if the nature
of the data is appropriately limited

Data Structures and Algorithms in C++, Fourth Edition

13

Hash Functions (continued)
• Radix Transformation
– With radix transformation, the key is transformed into a different base
– For instance, if K is 345 in decimal, its value in base 9 is 423
– This result is then divided modulo TSize, and the resulting value
becomes the address top which K is hashed
– The drawback to this approach is collisions cannot be avoided
– For example, if TSize is 100, then although 345 and 245 in decimal will
not collide, 345 and 264 will because 264 is 323 in base nine
– Since 345 is 423, these two values will collide when divided modulo
100

Data Structures and Algorithms in C++, Fourth Edition

14

Hash Functions (continued)
• Universal Hash Functions
– When little is known about the keys, a universal class of hash
functions can be used
– Functions are universal when a randomly chosen member of the class
will be expected to distribute a sample evenly, guaranteeing low
collisions
– This idea was first considered by Larry Carter and Mark Wegman in
1979
– We define H as a class of functions from a set of keys to a hash table of
size TSize
– H is called universal if no distinct pair of keys are mapped to the same
position in the table by a function chosen at random from h with a
probability of 1 / TSize
– This basically means there is one chance in TSize that two randomly
chosen keys collide when hashed with a randomly chosen function
Data Structures and Algorithms in C++, Fourth Edition

15

Hash Functions (continued)
• Universal Hash Functions (continued)
– One example of such a class of functions is defined for a prime
number p > |keys| and random numbers a and b
H = {ha,b(K): ha,b(K) = ((aK+b) mod p) mod TSize and 0 ≤ a, b < p}
– Another class of functions is for keys considered as sequences of
bytes, K = K0K1 … Kr-1
– For a prime p > 28 = 256 and a sequence a = a0 , a1 , …, ar-1 ,





 r 1 a K mod p modTSize and 0  a , a , , a  p
H  ha  K  : ha  K     i i 

0
1
r 1
 i 0



Data Structures and Algorithms in C++, Fourth Edition

16

Collision Resolution
• The hashing we’ve looked at so far does have problems with
multiple keys hashing to the same location in the table
• For example, consider a function that places names in a table
based on hashing the ASCII code of the first letter of the name
• Using this function, all names beginning with the same letter
would hash to the same position
• If we attempt to improve the function by hashing the first two
letters, we achieve better results, but still have problems
• In fact, even if we used all the letters in the name, there is still
a possibility of collisions
• Also, while using all the letters of the name gives a better
distribution, if the table only has 26 positions there is no
improvement in using the other versions
Data Structures and Algorithms in C++, Fourth Edition

17

Collision Resolution (continued)
• So in addition to using more efficient functions, we also need
to consider the size of the table being hashed into
• Even then, we cannot guarantee to eliminate collisions; we
have to consider approaches that assure a solution
• A number of methods have been developed; we will consider
a few in the following slides

Data Structures and Algorithms in C++, Fourth Edition

18

Collision Resolution (continued)
• Open Addressing
– In open addressing, collisions are resolved by finding an available table
position other than the one to which the key hashed
– If the position h(K) is already occupied, positions are tried in the
probing sequence
norm(h(K) + p(1)), norm(h(K) + p(2)),
...

until an open location is found, the same positions are tried again, or
the table is full
– The function p is called a probing function, i is the probe, and norm is
a normalization function, often division modulo the table size
– The simplest realization of this is linear probing, where the search
proceeds sequentially from the point of the collision
– If the end of the table is reached before finding an empty cell, it
continued from the beginning of the table
– If it reaches the cell before the one causing the collision, it then stops
Data Structures and Algorithms in C++, Fourth Edition

19

Collision Resolution (continued)
• Open Addressing (continued)
– The drawback to linear probing is that clusters of displaced keys tend
to form
– This is illustrated in Figure 10
...
1a, three keys have been hashed to their locations
– In Figure 10
...
1b)
– When B9 arrives, it has to be placed from the beginning of the table,
and finally, when C2 shows up, it is placed five locations away from its
home address

Data Structures and Algorithms in C++, Fourth Edition

20

Collision Resolution (continued)

Fig
...
1 Resolving collisions with the linear probing method
...
, TSize – 1
– Expressed as a sequence of probes, this is
h(K) + i2, h(K) – i2 for i = 1, 2,
...
, h(K) + (TSize – 1)2/4,
h(K) – (TSize – 1)2/4
– Each of these values is divided modulo TSize
– Because the value of h(K) tries only the even or odd positions in the
table, the size of the table should not be an even number
– The ideal value for the table size is a prime of the form 4j + 3, which j is
an integer
– This will guarantee that all the table locations will be checked in the
probing process
– Applying this to the example of Figure 10
...
2; B2 still takes two probes, but C2 only takes four

Data Structures and Algorithms in C++, Fourth Edition

23

Collision Resolution (continued)

Fig
...
2 Using quadratic probing for collision resolution

Data Structures and Algorithms in C++, Fourth Edition

24

Collision Resolution (continued)
• Open Addressing (continued)
– Notice that though we obtain better results with this approach, we
don’t avoid clustering entirely
– This is because the same probe sequence is used for any collision,
creating secondary clusters
– These are less of a problem than primary clusters, however
– Another variation is to have p be a random number generator (RNG)
– This eliminates the need to have special conditions on the table size,
and does prevent secondary cluster formation
– However, it does have an issue with repeating the same probing
sequence for the same keys

Data Structures and Algorithms in C++, Fourth Edition

25

Collision Resolution (continued)
• Open Addressing (continued)
– If the RNG is initialized with the first invocation, different probing
sequences will occur for the same key K
– As a result, the key is hashed more than once to the table, and there is
a chance it may not be found when searched for
– So the RNG needs to be initialized to the same seed for the same key
before the probing sequence is generated
– In C++ this can be done using the srand()function with a parameter
that is dependent on the key
– An RNG could be written that generates values between 1 and TSize -1
– The following code by Robert Morris works for tables with TSize = 2n
for integer n
generateNumber()
static int r = 1;
r = 5*r;
r = mask out n + 2 low-order bits of r;
return r/4;

Data Structures and Algorithms in C++, Fourth Edition

26

Collision Resolution (continued)
• Open Addressing (continued)
– The best approach to secondary clustering is through the technique of
double hashing
– This utilizes two hashing functions, one for the primary hash and the
other to resolve collisions
– In this way the probing sequence becomes
h(K), h(K) + hp(K),
...

– Here, h is the primary hashing function and hp is the secondary hash
– The table size should be a prime number so every location is included
in the sequence, since the values above are divided modulo TSize
– Empirical evidence shows that this approach works well to eliminate
secondary clustering, since the probe sequence is based on hp

Data Structures and Algorithms in C++, Fourth Edition

27

Collision Resolution (continued)
• Open Addressing (continued)
– This is because the probing sequence for key K1 hashed to location j is
j, j + hp(K1), j + 2 · hp(K1),
...
; if K2 hashes to 2j + 1, the
sequence is 2j + 1, 4j + 3, 10j + 11,
...
3

Fig
...
3 Formulas approximating, for different hashing methods, the average numbers of trials for successful and
unsuccessful searches (Knuth 1998)

Data Structures and Algorithms in C++, Fourth Edition

29

Collision Resolution (continued)
• Open Addressing (continued)
– Figure 10
...
10
...
5
– This is very fast for short lists, but as they increase in size,
performance can degrade sharply
– Gains in performance can be made if the lists are ordered so
unsuccessful searches don’t traverse the entire list, or by using selforganizing linked lists
– This approach requires additional space for the pointers, so if there
are a large number of keys involved, space requirements can be high
Data Structures and Algorithms in C++, Fourth Edition

31

Collision Resolution (continued)
• Chaining (continued)

Fig
...
5 In chaining, colliding keys are put on the same linked list
...
6 shows an example of coalesced hashing where a key ends
up in the last position in the table
Data Structures and Algorithms in C++, Fourth Edition

33

Collision Resolution (continued)
• Chaining (continued)

Fig
...
6 Coalesced hashing puts a colliding key in the last available position of the table

– There is no collision in Figure 10
...
6b shows this; Figure 10
...
7 shows the use of a cellar

Fig
...
7 Coalesced hashing that uses a cellar

Data Structures and Algorithms in C++, Fourth Edition

35

Collision Resolution (continued)
• Chaining (continued)
– Keys that hash without colliding are stored in their home locations, as
can be seen in Figure 10
...
7b
– If the cellar is filled, available cells are used in the table, as seen in
Figure 10
...
8

Data Structures and Algorithms in C++, Fourth Edition

37

Collision Resolution (continued)
• Bucket Addressing (continued)

Fig
...
8 Collision resolution with buckets and linear probing method

– Collisions can be stored in an overflow area, in which case the bucket
includes a field to indicate if a search should consider that area or not
– If used with chaining, the field could indicate the beginning of the list
in the overflow area associated with the bucket shown in Figure 10
...
10
...
10a, which stores keys using
linear probing
• In Figure 10
...
10c, when A2 is deleted,
causing searches for B1 to stop at position 2
Data Structures and Algorithms in C++, Fourth Edition

40

Deletion (continued)

Fig
...
10 Linear search in the situation where both insertion and deletion of keys are permitted

– A solution to this is to leave the deleted keys in the table with some
type of indicator that the keys are not valid
– This way, searches for elements won’t terminate prematurely
– When new keys are inserted, they can overwrite the marked keys
Data Structures and Algorithms in C++, Fourth Edition

41

Deletion (continued)
• The drawback to this is that of the table has far more
deletions than insertions
• It will become overloaded with deleted records, slowing down
search times, because open addressing requires testing each
element
• So the table needs to be purged periodically by moving
undeleted elements to cells occupied by deleted elements
• Those cells containing deleted elements not overwritten can
then be marked as available
• This is shown in Figure 10
...
Cichelli developed one technique for constructing minimal
perfect hash functions in 1980
– It is used to hash fairly small number of reserved words, and has the
form
h(word) = (length(word) + g(firstletter(word)) + g(lastletter(word))) mod TSize
– In the expression, g is the function to be constructed; it assigns values
to letters to the function h returns unique hash values
– The values that g assigns to particular letters do not have to be unique
– There are three parts to the algorithm: computation of letter
occurrences, ordering words, and searching
– The last step is critical, and uses an auxiliary function, try()

Data Structures and Algorithms in C++, Fourth Edition

46

Perfect Hash Functions (continued)
• Cichelli’s Method (continued)
– The pseudocode for the algorithm is shown on pages 563 and 564
– It is used to construct the hash function for the names of the nine
Muses: Calliope, Clio, Erato, Euterpe, Melpomene, Polyhymnia,
Terpsichore, Thalia, and Urania
– Counting the first and last letters of each name gives the distribution A
(3), C (2), E (6), M (1), O (2), P (1), T (2), and U (1)
– Based on this, we can order the names as follows: Euterpe, Calliope,
Erato, Terpsichore, Melpomene, Thalia, Clio, Polyhymnia, and Urania
– Following this the search()routine is applied, as shown in Figure
10
...
10
...

– The FHCD algorithm (named for its developers, Edward Fox, Lenwood
Heath, Qi Chen, and Amjad Daoud) is an extension of Sager’s work,
and was proposed in 1992
– It searches for minimal perfect hash functions of the form:
h(word) = h0(word) + g(h1(word)) + g(h2(word))
– The result is divided modulo TSize, and g is the function determined by
the algorithm
– Three tables of random numbers, T0, T1, and T2 are defined, one for
each corresponding function hi
– Each word in the formula equals a string of characters c1c2 … cm
corresponding to a triple (h0(word), h1(word), h2(word))

Data Structures and Algorithms in C++, Fourth Edition

52

Perfect Hash Functions (continued)
• The FHCD Algorithm (continued)
– The elements of the triple are calculated using the formulas
h0 = (T0(c1) + · · · + T0(cm)) mod n
h1 = (T1(c1) + · · · + T1(cm)) mod r
h2 = ((T2(c1) + · · · + T2(cm)) mod r) + r
– In these formulas
• n is the number of all the words in the body of data
• r is a parameter typically < n/2
• Ti(cj) is the number generated in table Ti for character cj
– The function g is then found in three steps: mapping, ordering and
searching
– In mapping, n triples of the form (h0(word), h1(word), h2(word)) are
formed

Data Structures and Algorithms in C++, Fourth Edition

53

Perfect Hash Functions (continued)
• The FHCD Algorithm (continued)
– The randomness of the hi functions typically guarantees the
uniqueness of the triples; if they aren’t, new Ti tables are created
– A dependency graph is then constructed; this is a bipartite graph
– Half the vertices corresponding to values of h1 are labeled 0 to r – 1;
the other half correspond to h2 and are labeled r to 2r – 1
– The edge of the graph between vertices h1(word) and h2(word)
corresponds to a word
– To illustrate this, the names of the nine Muses are again used in Figure
10
...
12a

Data Structures and Algorithms in C++, Fourth Edition

54

Perfect Hash Functions (continued)
• The FHCD Algorithm (continued)

Fig
...
12 Applying the FHCD algorithm to the names of the nine Muses
Data Structures and Algorithms in C++, Fourth Edition

55

Perfect Hash Functions (continued)
• The FHCD Algorithm (continued)
– The corresponding dependency graph (with r = 3) is shown in Figure
10
...
, vt of vertices is
established which is the set of all the edges that connect vi with those
vjs for which j < i
– The sequence is started the with the vertex of maximum degree
– Then, for each position i of the series, a vertex vi is chosen from the
vertices having maximal degree which have at least one connection to
the vertices v1,
...
12c
– Finally, in the searching step, the keys are assigned hash values level
by level
– The first vertex has its g value chosen randomly from among the
numbers 0, …, n – 1
– For the other vertices, we have the relation if vi < r then vi = h1
– So each word in level K(vi) has the same value g(h1(word)) = g(vi)

Data Structures and Algorithms in C++, Fourth Edition

57

Perfect Hash Functions (continued)
• The FHCD Algorithm (continued)
– In addition, g(h2(word)) is already defined because it equals some vj
that has been processed
– By analogy, we can apply this to the case of vi > r and vi = h2, so for
each word either g(h1(word)) or g(h2(word)) is known
– For each level, the second g value is found randomly so the values
obtained from the formula for h indicate available table positions
– Since the first choice of random number will not always fit all the
words on a particular level to the table, both numbers may be needed
– In considering the hashing of the names of the nine Muses, the search
step starts by randomly choosing g(v1)
– For v1 = 2, we let g(2) = 2; the next vertex is v2= 5, so K(v2) = {Erato}

Data Structures and Algorithms in C++, Fourth Edition

58

Perfect Hash Functions (continued)
• The FHCD Algorithm (continued)
– From Figure 10
...
12d; Figure 10
...


Data Structures and Algorithms in C++, Fourth Edition

62

Rehashing (continued)
• All the methods we’ve looked at can use rehashing, as they
then continue using the processes of hashing and collision
resolution that were in place before rehashing
• One method in particular for which rehashing is important is
cuckoo hashing, first described by Rasmus Pagh and
Flemming Rodler in 2001

Data Structures and Algorithms in C++, Fourth Edition

63

Rehashing (continued)
• The cuckoo hashing
– Two tables, T1 and T2, and two hash tables, h1 and h2, are used in the
cuckoo hash
– Inserting a key K1 into table T1 uses hash function h1, and if T1[h1(K1)] is
open, the key is inserted there
– If the location is occupied by a key, say K2, this key is removed to allow
K1 to be placed, and K2 is placed in the second table at T2[h2(K2)]
– If this location is occupied by key K3, it is moved to make room for K2,
and an attempt is made to place K3 at T1[h1(K3)]
– So the key that is being placed in or moved to the table has priority
over a key that is already there
– There is a possibility that a sequence like this could lead to an infinite
loop if it ends up back at the first position that was tried

Data Structures and Algorithms in C++, Fourth Edition

64

Rehashing (continued)
• The cuckoo hashing (continued)
– It is also possible the search will fail because both tables are full
– To circumvent this, a limit is set on tries which if exceeded causes
rehashing to take place with two new, larger tables and new hash
functions
– Then the keys from the old tables are rehashed to the new ones
– If during this the limit on number of tries is exceeded, rehashing is
performed again with yet larger tables and new hash functions
– The pseudocode for this process is shown on the next slide
– Execution of this algorithm is illustrated in Figure 10
...
10
...
13a
– When key 7 is inserted, h1(7) = h1(2) so 2 is moved out to make room
for 7 in T1, and 2 is placed in T2[h2(2)], shown in Figure 10
...
13c-d)
– This process continues in Figure 10-13e-h, so the potential loop is
broken by exceeding the value of maxloop, and rehashing occurs
– Figure 10
...
Knott in 1971
– Dynamic hashing, developed by Per-Âke Larson in 1978
– Extendible hashing, developed by Ronald Fagin and others in 1979

• All of these distribute keys among buckets in similar ways
• The structure of the directory or index is the main difference
Data Structures and Algorithms in C++, Fourth Edition

71

Hash Functions for Extendible Files
(continued)
• A binary tree is used as the bucket index in expandable and
dynamic hashing; extendible hashing uses a table
• Virtual hashing, developed in 1978, is a directoryless
technique, defined as “any hashing which may dynamically
change its hashing function” by its developer, Witold Litwin
• One example of this technique is linear hashing, developed by
Litwin in 1980

Data Structures and Algorithms in C++, Fourth Edition

72

Hash Functions for Extendible Files
(continued)
• Extendible Hashing
– Consider a dynamically changing file with fixed-size buckets to which a
hashing method has been applied
– The data stored in the buckets is indirectly accessed through an index
that is dynamically adjusted to reflect file changes
– The index, which is an expandable table, is the characteristic feature of
extendible hashing
– When certain keys are hashed, they indicate a position in the index,
not the file
– The values the hash function return are called pseudokeys
– This way, the file doesn’t need reorganization as data are added or
deleted, because the changes are recorded in the index

Data Structures and Algorithms in C++, Fourth Edition

73

Hash Functions for Extendible Files
(continued)
• Extendible Hashing (continued)
– Only one hash function can be used, but depending on the size of the
table, only a portion of the address h(K) is used
– An easy way to do this is to look at the address as a string of bits from
which only the leftmost i bits can be used
– This number i is called the depth of the directory; a directory of depth
two is shown in Figure 10
...
14 shows the h values in the buckets, which represent the
keys actually stored in the buckets
Data Structures and Algorithms in C++, Fourth Edition

74

Hash Functions for Extendible Files
(continued)

Fig
...
14 An example of extendible hashing
Data Structures and Algorithms in C++, Fourth Edition

75

Hash Functions for Extendible Files
(continued)
• Extendible Hashing (continued)
– There is a local depth associated with each bucket that indicates the
number of leftmost bits in h(K); these bits are the same for all keys in
the bucket
– These local depths are shown at the top of each bucket in Figure
10
...
14b; when a key with h value 11001
arrives, the first two bits send it to the fourth directory position
– From there it goes to bucket b1, where keys whose h-values start with
1 are stored
– This causes an overflow, splitting bucket b1 into b10, which is the new
name for b1, and b11; their local depths are set to two
– Bucket b11 is now pointed at by the pointer in position 11, and b1’s keys
are split between b10 and b11
– Things are more complicated if a bucket whose local depth is equal to
the depth of the directory overflows
– Consider what happens when a key with h-value 0001 arrives at the
table in Figure 10
...
15 illustrates a series of initial splits where TSize = 3
– To begin with, the pointer split is 0

Data Structures and Algorithms in C++, Fourth Edition

81

Hash Functions for Extendible Files
(continued)
• Linear hashing (continued)

Fig
...
15 Splitting buckets in the linear hashing technique

– If the loading factor exceeds a threshold, a new bucket is made, keys
from bucket zero are distributed between buckets zero and three, and
split is incremented
Data Structures and Algorithms in C++, Fourth Edition

82

Hash Functions for Extendible Files
(continued)
• Linear hashing (continued)
– The question is, how to perform this distribution?
– With only one hash function, keys from bucket zero are hashed to
bucket zero before and after splitting
– So one function isn’t adequate to perform this
– Linear hashing maintains two hash functions at each level of splitting,
hlevel and hlevel+1 with hlevel (K) = K mod (TSize ∙ 2level)
– The first function hashes keys to buckets on the current level that have
not yet been split; the second hashes keys to already split buckets
– The algorithm for linear hashing is shown on pages 574 and 575

Data Structures and Algorithms in C++, Fourth Edition

83

Hash Functions for Extendible Files
(continued)
• Linear hashing (continued)
– It may not be clear when a bucket should be split; as the algorithm
assumes, a threshold value of the loading factor is used to decide
– This threshold has to be determined in advance, usually by the
designer of the program
– To illustrate, assume keys can be hashed to buckets in a file, and if a
bucket is full, overflowing keys are put on a linked list in an overflow
area
– This situation is shown in Figure 10
...
10
...
16a is 75%, but when key 10 arrives it is
hashed to position 1 and loading increases to 83%
– The first bucket splits and keys are distributed using function h1,
shown in Figure 10
...
16c
– The loading factor rises to 87%, so another split, this time the second
bucket, gives rise to the configuration in Figure 10
...
16e
Data Structures and Algorithms in C++, Fourth Edition

86

Hash Functions for Extendible Files
(continued)
• Linear hashing (continued)
– In this case, split reached the last allowed value on this level, so it is
assigned the value zero, and h1 is retained for use in further hashing
– In addition, a new function, h2 is defined as K mod 4 ∙ TSize
– These steps are presented in a table on the next slide
– Note that since the order of splitting is predetermined, some overflow
area is needed with linear hashing
– If files are used, this may lead to more than one file access
– The area can be distinct from buckets, but can also work in the fashion
of coalesced hashing by using empty space in buckets, as suggested by
James Mullin in 1981
– Overflow areas are not necessary in a directory scheme, but can be
used
Data Structures and Algorithms in C++, Fourth Edition

87

Hash Functions for Extendible Files
(continued)

Table 10-1 Steps in carrying out the hashing and splitting of Figure 10
Title: Data Structure Chap10
Description: Chapter 10: Hashing