lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Harwood <markharw...@yahoo.co.uk>
Subject Re: Polymorphic Index
Date Thu, 21 Oct 2010 22:12:49 GMT
Good point, Toke. Forgot about that. Of course doubling the number of hash algos used to 4
increases the space massively. 

On 21 Oct 2010, at 22:51, Toke Eskildsen <te@statsbiblioteket.dk> wrote:

> Mark Harwood [markharw00d@yahoo.co.uk]:
>> Given a large range of IDs (eg your 300 million) you could constrain
>> the number of unique terms using a double-hashing technique e.g.
>> Pick a number "n" for the max number of unique terms you'll tolerate
>> e.g. 1 million and store 2 terms for every primary key using a 
>> different hashing function e.g.
> 
>> int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms.
>> int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms.
> 
>> Then queries to retrieve/delete a record use a search for hashedKey1
>> AND hashedKey2. The probability of having the same collision on two
>> different hashing functions is minimal and should return the original record only.
> 
> I am sorry, but this won't work. It is a variation of the birthday paradox:
> http://en.wikipedia.org/wiki/Birthday_problem
> 
> Assuming the two hash-functions are ideal so that there will be 1M different values from
each after the modulo, the probability for any given pair of different UIDs having the same
hashes is 1/(1M * 1M). That's very low. Another way to look at it would be to say that there
are 1M * 1M possible values for the aggregated hash function.
> 
> Using the recipe from
> http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem
> we have
> n = 300M
> d = 1M * 1M
> and the formula 1-((d-1)/d)^(n*(n-1)/2) which gets expanded to
> http://www.google.com/search?q=1-(((1000000000000-1)/1000000000000)^(300000000*(300000000-1)/2)
> 
> We see that the probability of a collision is ... 1. Or rather, so close to 1 that Google's
calculator will not show any decimals. Turning the number of UIDs down to just 3M, we still
get the probability 0.988889881 for a collision. It does not really help to increase the number
unique hashes as there are simply too many possible pairs of UIDs.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message