Mark Harwood [markharw00d@yahoo.co.uk]:
> Given a large range of IDs (eg your 300 million) you could constrain
> the number of unique terms using a doublehashing technique e.g.
> Pick a number "n" for the max number of unique terms you'll tolerate
> e.g. 1 million and store 2 terms for every primary key using a
> different hashing function e.g.
> int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms.
> int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms.
> Then queries to retrieve/delete a record use a search for hashedKey1
> AND hashedKey2. The probability of having the same collision on two
> different hashing functions is minimal and should return the original record only.
I am sorry, but this won't work. It is a variation of the birthday paradox:
http://en.wikipedia.org/wiki/Birthday_problem
Assuming the two hashfunctions are ideal so that there will be 1M different values from each
after the modulo, the probability for any given pair of different UIDs having the same hashes
is 1/(1M * 1M). That's very low. Another way to look at it would be to say that there are
1M * 1M possible values for the aggregated hash function.
Using the recipe from
http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem
we have
n = 300M
d = 1M * 1M
and the formula 1((d1)/d)^(n*(n1)/2) which gets expanded to
http://www.google.com/search?q=1(((10000000000001)/1000000000000)^(300000000*(3000000001)/2)
We see that the probability of a collision is ... 1. Or rather, so close to 1 that Google's
calculator will not show any decimals. Turning the number of UIDs down to just 3M, we still
get the probability 0.988889881 for a collision. It does not really help to increase the number
unique hashes as there are simply too many possible pairs of UIDs.

To unsubscribe, email: devunsubscribe@lucene.apache.org
For additional commands, email: devhelp@lucene.apache.org
