lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Harwood <>
Subject Re: Polymorphic Index
Date Thu, 21 Oct 2010 20:56:20 GMT
Perhaps another way of thinking about the problem:

Given a large range of IDs (eg your 300 million) you could constrain the number of unique
terms using a double-hashing technique e.g.
Pick a number "n" for the max number of unique terms you'll tolerate e.g. 1 million and store
2 terms for every primary key using a different hashing function e.g.

int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms.
int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms.

Then queries to retrieve/delete a record use a search for hashedKey1 AND hashedKey2. The probability
of having the same collision on two different hashing functions is minimal and should return
the original record only.
Obviously you would still have the postings recorded but these would be slightly more compact
e.g each of your 1 million unique terms would have ~300 gap-encoded vints entries as opposed
to 300m postings of one full int.


On 21 Oct 2010, at 20:44, eks dev wrote:

> Hi All, 
> I am trying to figure out a way to implement following use case with 
> lucene/solr. 
> In order to support simple incremental updates (master) I need to index  and 
> store UID Field on 300Mio collection. (My UID is a 32 byte  sequence). But I do 
> not need indexed (only stored) it during normal  searching (slaves). 
> The problem is that my term dictionary gets blown away with sheer number  of 
> unique IDs. Number of unique terms on this collection, excluding UID  is less 
> than 7Mio.
> I can tolerate resources hit on Updater (big hardware, on disk index...).
> This is a master slave setup, where searchers run from RAMDisk and  having 
> 300Mio * 32 (give or take prefix compression) plus pointers to  postings and 
> postings is something I would really love to avoid as this  is significant 
> compared to really small documents I have. 
> Cutting to the chase:
> How I can have Indexed UID field, and when done with indexing:
> 1) Load "searchable" index into ram from such an index on disk without one 
> field? 
> 2) create 2 Indices in sync on docIDs, One containing only indexed UID
> 3) somehow transform index with indexed UID by droping UID field, preserving 
> docIs. Kind of tool smart index-editing tool. 
> Something else already there i do not know?
> Preserving docIds is crucial, as I need support for lovely incremental  updates 
> (like in solr master-slave update). Also Stored field should  remain!
> I am not looking for "use MMAPed Index and let OS deal with it advice"... 
> I do not mind doing it with flex branch 4.0, nut being in a hurry.
> Thanks in advance, 
> Eks 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message