Saturday, February 19, 2011

C# :- GetHashCode() : HashTable and Dictionary

In C# when coming to data structures  HashTable and Dictionary provide us with the ability to search for the element in them based on the data rather that the index .  Arrays are stored contiguous in memory and to access the item in array there is constant asymptotic runtime  O(1)  , because we only have to know the index of the value to be retrieved. Searching through an unsorted  array has linear runtime of   O(n) and through sorted array of O(log n).  Array are useful when we have to directly access the value based on ordinal value.  But when dealing with complex data we do rarely use the index based  approach for accessing and searching items.

We are more concerned on accessing and searching items based upon real information in a data set ,  that can be anything from enrolment no’s  , credit card  no’s or complex alphanumeric sequences.  Here comes the HashTable and Dictionary , these data structure allows us to use our complex data information for indexing . To do so these structure uses hashing function to compress our complex  indexer to a reasonable space limit  and  optimal  asymptotic runtime.

Hashing function simply compresses the indexer we are using , so that a optimal set of indexer can be created  from original indexer.  If we would have to use our original complex indexer then we would have to accommodate all the values of indexer , which would be very large set of indexers to maintain relatively small set of values.  So, hashing function is the mapping from the original set of indexer to the compressed and optimal set of the indexer.

But hash values generated by the hashing function is not unique.  So we have multiple original indexers values having same compressed or hashed indexer value. This is known as hashing collisions. Collision avoidance primarily depends upon the implementation of hash function  algorithm . There are different collision resolution strategy , among them are Linear Probing and Quadratic probing

In linear probing  , we create the hash and place the value at the hashed indexer, but if we find that there is already a value allocated to the indexer (as same hashed indexer could have been generated previously by hash function)  , we simply move on to next indexer (i + 1 ….   i +n)  until we find an empty indexer slot.  In Quadratic probing we use same method as linear probing but when a indexer is encountered having the some value we move on to (i + 1^2), if it is also occupied we look at  (i-1^2)  indexer and keep on looking till we find a empty indexer slot in pattern  (i + 2^2), (i -2^2) …. (i + n^2), (i-n^2) .

HashTable and Dictionary uses different collision resolution techniques called rehashing and chaining . In .Net the HashTable contains the method GetHashCode()  that is inherited from Object class , which is responsible for creating a unique integer value .  GetHashCode() method generate unique integer for any value whether its string or any other type ,  as all types are derived from Object class. This hashcode is used as indexer to allocate a slot for coming values.

HashTable also contains a property LoadFactor that is ratio of total number of items in hashtable to  total number of available slots. Optimal loadfactor as stated by .Net is  0.72 .  So whenever we push a new item into hashtable,  .Net checks to see that loadfactor is not exceeded . In hashtable hash values are calculated based upon the loadfactor , so for a insertion of a value , hash table is rehashed to maintain the loadfactor.

Dictionary is strongly type as it uses generics in comparison to HashTable and it also uses different collision resolution technique known  as chaining . In dictionary a separate data structure is used to hold  the conflicting mapping.

As we have seen in Linear and Quadratic probing that we use a pattern to probe for next available slot for value , but in dictionary a data structure is maintained at the conflicting hashed indexer to hold all the values at that indexer. So we have same hashed indexer for different values in a data strcuture / linked list .  As the accessing and searching in HashTable and Dictionary does not depend upon the number of items due to loadfactor, the  asymptotic runtime  is         O(1) as compared to O(n) in arrays, which is fast.  

http://msdn.microsoft.com/en-in/vcsharp/default.aspx?pull=/library/en-us/dv_vstechart/html/datastructures_guide.asp

http://msdn.microsoft.com/en-us/library/ms379571(VS.80).aspx