Good point, Toke. Forgot about that. Of course doubling the number of hash algos used to 4
increases the space massively.
On 21 Oct 2010, at 22:51, Toke Eskildsen <te@statsbiblioteket.dk> wrote:
> Mark Harwood [markharw00d@yahoo.co.uk]:
>> Given a large range of IDs (eg your 300 million) you could constrain
>> the number of unique terms using a doublehashing technique e.g.
>> Pick a number "n" for the max number of unique terms you'll tolerate
>> e.g. 1 million and store 2 terms for every primary key using a
>> different hashing function e.g.
>
>> int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms.
>> int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms.
>
>> Then queries to retrieve/delete a record use a search for hashedKey1
>> AND hashedKey2. The probability of having the same collision on two
>> different hashing functions is minimal and should return the original record only.
>
> I am sorry, but this won't work. It is a variation of the birthday paradox:
> http://en.wikipedia.org/wiki/Birthday_problem
>
> Assuming the two hashfunctions are ideal so that there will be 1M different values from
each after the modulo, the probability for any given pair of different UIDs having the same
hashes is 1/(1M * 1M). That's very low. Another way to look at it would be to say that there
are 1M * 1M possible values for the aggregated hash function.
>
> Using the recipe from
> http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem
> we have
> n = 300M
> d = 1M * 1M
> and the formula 1((d1)/d)^(n*(n1)/2) which gets expanded to
> http://www.google.com/search?q=1(((10000000000001)/1000000000000)^(300000000*(3000000001)/2)
>
> We see that the probability of a collision is ... 1. Or rather, so close to 1 that Google's
calculator will not show any decimals. Turning the number of UIDs down to just 3M, we still
get the probability 0.988889881 for a collision. It does not really help to increase the number
unique hashes as there are simply too many possible pairs of UIDs.
> 
> To unsubscribe, email: devunsubscribe@lucene.apache.org
> For additional commands, email: devhelp@lucene.apache.org
>

To unsubscribe, email: devunsubscribe@lucene.apache.org
For additional commands, email: devhelp@lucene.apache.org
