Valid
        XHTML 1.1! Valid CSS!
Created 2005-09-16   Modified 2009-04-11
Chelton Evans

proj bucket home

Intro
Source
Experimental Results for 1D Sorting on double Type
Theory
Worst Case Complexity and the Hybrid Sorting Algorithm
Hash Functions
TODO

Source

Files

Makefile
bucket.h
buckethybrid.h
bucketlink.h
buckettest.cpp
buckettest.h
hashfunction01.cpp
hashfunction01.h
hashtable.h
hashtable2.h
hashtabletest.cpp
hashtabletest.h
main.cpp
stringhash.cpp
stringhash.h
stringpairhash.h
unitdouble.h

projcompile.txt
testscript01.txt
unittestsreport.txt

Doxygen

main.cpp
Makefile
bucket
buckethybrid
bucketlink
buckettest
hashfunction01
hashtable
hashtable2
hashtableiterator
hashtabletest
stringhash
stringpair
stringpairhash
unitdouble

Intro

This is a generalized bucket sorting implementation. I truly mean this in the literal sense, the client defines both the index function and the comparison or less than operator.

So it is a case of what I haven't said thats interesting. This class was designed like this because I want to sort on the same data structure in different ways. And I want to do it in O(n) time.

I have developed a hybrid which forwards the sorting to a default sorter in the worst complexity case.

Experimental Results for 1D Sorting on double Type

40000 0.01
80000 0.03
160000 0.08
320000 0.14
640000 0.29
1280000 0.61
2560000 1.36
5120000 3.01
10240000 7.42
Killed

When plotted this gives a complexity of O(n^1.1). The target was O(n). There seems to be a growth after a 1,280,000 point were used. If we ignore the first measurement and all after 1,280,000 then we get the following dataset which gives a textbook linear relationship.

80000 0.03
160000 0.08
320000 0.14
640000 0.29
1280000 0.61
img01.png img02.png

Worst Case Complexity and the Hybrid Sorting Algorithm

Consider when bucket has all the same data. The sorting then becomes O(n^2).

Imagine that the index function inserted all the elements into the same location, and that all the data was already ordered from smallest to greatest. Then each insertion would go down the list and insert at the end - O(n) operation and there are n such elements so n by n is n squared.

This was changed to O( n logn ) by detecting such an event and using standard O( n log n ) sorter. By having a hybrid sorter we guarantee the complexity in the worst cases and get linear complexity in the best.

Detection consisted of each link having a maximum number of links counter. If this counter exceeds a thresh-hold value dependent on the data set size, then have all values that are inserted into this index or list be moved and inserted into a O( n log n ) sorter. See buckethybrid.

Essentially after all the elements are inserted, the two containers are merged in O(n) time.

TODO

Hash Functions

The bucket sort uses a special hash function which creates a hash on the data value that is ordered.

From an implementation point of view hashing and bucket sorting are similar and their theory is intertwined.

Further both structures can be used for iterating over the data in O(n) time. This property does not get mentioned much but the hash tables are also useful for iterating over the data set once its inserted. Just test if the link is not empty and iterate over the chain, do this for all chains.

I implemented a hash table and a known hash function. The basic idea of a hash function is to uniquely randomize a distribution. i.e. From a non random distribution produce a distribution that is random.

As mentioned elsewhere hash functions use shifting, and other logical operations on data to produce one-way functions on the data.

I designed the hash table to return a link to the data so that the client has the opportunity to modify the data. A classic example is data with an embedded reference counter. Just access the data, increment the counter and leave to increase the data's reference.