# lucene-dev mailing list archives

##### Site index · List index
Message view
Top
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject RE: Polymorphic Index
Date Thu, 21 Oct 2010 21:51:52 GMT
```Mark Harwood [markharw00d@yahoo.co.uk]:
> Given a large range of IDs (eg your 300 million) you could constrain
> the number of unique terms using a double-hashing technique e.g.
> Pick a number "n" for the max number of unique terms you'll tolerate
> e.g. 1 million and store 2 terms for every primary key using a
> different hashing function e.g.

> int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms.
> int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms.

> Then queries to retrieve/delete a record use a search for hashedKey1
> AND hashedKey2. The probability of having the same collision on two
> different hashing functions is minimal and should return the original record only.

I am sorry, but this won't work. It is a variation of the birthday paradox:
http://en.wikipedia.org/wiki/Birthday_problem

Assuming the two hash-functions are ideal so that there will be 1M different values from each
after the modulo, the probability for any given pair of different UIDs having the same hashes
is 1/(1M * 1M). That's very low. Another way to look at it would be to say that there are
1M * 1M possible values for the aggregated hash function.

Using the recipe from
http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem
we have
n = 300M
d = 1M * 1M
and the formula 1-((d-1)/d)^(n*(n-1)/2) which gets expanded to