lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Polymorphic Index
Date Fri, 22 Oct 2010 09:23:35 GMT
Sure, it all would work and would be better than "naive" index UID. 
Mapping  more UIDs to one permits this compromise "Number of unique terms in 
term dict against CPU during update to resolve collisions". 


I like Paul's idea with more fields,  it reduces number of UIDs in term 
dictionary, but increases density of postings lists for these terms. It 
simplifies update as no collisions are possible, just makes  it slower.  


It is all too fiddly and suboptimal, one needs to tone to find an optimum here, 
but hey, better than naive approach. 


Both of these solutions are just  better way to do it wrong :) The real solution 
is definitely somewhere around ParallelReader usage.

Ideally, one should be able to say by opening index which parts of index he is 
going to be using. One way to do it is to to create Parallel Indexes, searching 
part is fully functional and already there. 


Anyone using ParallelReader, any tips on creating parallel indexes?

In my particular case, ParallelReader is not strictly necessary, because I 
"only" need to filter-out one Field from termDictionary  and its Postings during 
RAMDisk loading. One has some flexibility  to do a lot with SwithDirectory, but 
postings for one field are not in separate files...


Thanks for good tips, we found two better solutions for our "UID use cases 
toolbox"

Cheers, eks







  





----- Original Message ----
> From: Toke Eskildsen <te@statsbiblioteket.dk>
> To: "dev@lucene.apache.org" <dev@lucene.apache.org>
> Sent: Fri, 22 October, 2010 0:32:04
> Subject: RE: Polymorphic Index
> 
> From: Mark Harwood [markharw00d@yahoo.co.uk]
> > Good  point, Toke. Forgot about that. Of course doubling the number
> > of hash algos used to 4 increases the space massively.
> 
> Maybe your hashing-idea  could work even with collisions?
> 
> Using your original two-hash suggestion,  we're just about sure to get 
>collisions. However, we are still able to uniquely  identify the right document 
>as the UID is also stored (search for the hashes,  iterate over the results and 
>get the UID for each). When an update is requested  for an existing document, 
>the indexer extracts the UIDs from all the documents  that matches the hash. 
>Then it performs a delete of the hash-terms and  re-indexes all the documents 
>that had "false" collisions. As the number of  unique hash-values as well as 
>hash-function can be adjusted, this could be a  nicely tweakable 
>performance-vs-space trade off.
> 
> This will only work if  it is possible to re-create the documents from stored 
>terms or by requesting the  data from outside of Lucene by UID. Is this possible 
>with your setup, eks  dev?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For  additional commands, e-mail: dev-help@lucene.apache.org
> 
> 


      

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message