lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Binda <olivier.bi...@wanadoo.fr>
Subject Re: remapping docIds in a read only offline built index
Date Mon, 02 Jun 2014 07:36:48 GMT
Very nice ! That is exactly what I needed. Thank you very much !


On 06/02/2014 09:26 AM, Michael McCandless wrote:
> The index sorting APIs (in lucene/misc) can do this.  E.g. you could
> make a SortingAtomicReader, with your sort criteria, then use
> addIndexes(IR[]) to add it to a new index.  That resulting index would
> have 1 segment and the docIDs would be in your order.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, May 12, 2014 at 12:01 PM, Olivier Binda
> <olivier.binda@wanadoo.fr> wrote:
>> In a 1-segment (parallel) read-only index, that is built offline once (and
>> then frozen),
>> is it possible to remap the docIds as the last step (i.e... to have the
>> exact same index, except that the docIds are all equal to the ord the docs
>> where added to the index) ?
>>
>> Say I have the read only index
>>
>> docId   : document
>> 1 : bookB
>> 2 : sentenceB
>> 3 : linkA
>> 4 : linkC
>> 5 : sentenceC
>> 6 : sentenceA
>> 7 : bookA
>> ...
>> 300000 : linkD
>>
>> I would like to have instead the read-only index
>>
>> docId   : document
>> 1 : bookA
>> 2 : bookB
>> ....
>>
>> M : linkA
>> M+1: linkB
>> ...
>> N+1 : sentenceA
>> N+2 : sentenceB
>> ...
>> 300000:sentenceZZZ
>>
>> This would allow me to reduce the amount of ram to cache the type of each
>> document
>>
>> -> without remapping, I need at least log2(types)* documents bits
>> here 2 * 300000 bits
>>
>> -> with remapping, I need only to remember ints M and N
>>
>> Also, if I need to cache 1 byte of metadata for each book
>>
>> -> without remapping, I would need 1 byte * documents
>> here 300000 bytes
>>
>> -> with remapping, I would only need 1 byte * books
>> here M - 1 bytes
>>
>>
>> I tried building such an index with LogMergePolicy/NoMergePolicy/extending
>> the ram buffer but (maybee I did something wrong),
>> the docIds were always reshuffled (maybee because my index was big and I was
>> over a threshold)
>>
>>
>>
>> Best regards,
>> Olivier
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message