lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: lucene indexing and merge process
Date Sat, 20 Oct 2007 11:57:56 GMT

For case 1, can you describe your document structure?  Do you have a  
lot of other fields besides the UID field?  Most importantly, do you  
have some large fields?

Did you give the FieldSelector mechanism a try?

In fact, I think you may even be able to create a caching  
FieldSelector implementation.  We could a add a FieldSelectorResult,  
something like LOAD_AND_CACHE that then caches the info for that Doc,  
Field combination.  Would have to investigate further, but it seems  
like it might work.

Just thinking out loud...


On Oct 18, 2007, at 10:38 AM, Erik Hatcher wrote:

> Forwarding this to java-dev per request.  Seems like the best place  
> to discuss this topic.
> 	Erik
> Begin forwarded message:
>> From: "John Wang" <>
>> Date: October 17, 2007 5:43:29 PM EDT
>> To:
>> Subject: lucene indexing and merge process
>> Hi Erik:
>>     We are revamping our search system here at LinekdIn. And we  
>> are using Lucene.
>>     One issue we ran across is that we store an UID in Lucene  
>> which we map to the DB storage. So given a docid, to lookup its  
>> UID, we have the following solutions:
>> 1) Index it as a Stored field and get it from reader.document  
>> (very slow if recall is large)
>> 2) Load/Warmup the FieldCache (for large corpus, loading up the  
>> indexreader can be slow)
>> 3) construct it using the FieldCache and persist it on disk  
>> everytime the index changes. (not suitable for real time indexing,  
>> e.g. this process will degrade as # of documents get large)
>>     None of the above solutions turn out to be adequate for our  
>> requirements.
>>      What we end up doing is to modify Lucene code by changing  
>> SegmentReader,DocumentWriter,and FieldWriter classes by taking  
>> advantage of the Lucene Segment/merge process. E.g:
>>      For each segment, we store a .udt file, which is an int[]  
>> array, (by changing the FieldWriter class)
>>      And SegmentReader will load the .udt file into an array.
>>      And merge happens seemlessly.
>>      Because the tight encapsulation around these classes, e.g.  
>> private and final methods, it is very difficult to extend Lucene  
>> while avoiding branch into our own version. Is there a way we can  
>> open up and make these classes extensible? We'd be happy to  
>> contribute what we have done.
>>      I guess to tackle the problem from a different angle: is  
>> there a way to incorporate FieldCache into the segments (it is  
>> strictly in memory now), and build disk versions while indexing.
>>      Hope I am making sense.
>>     I did not send this out to the mailing list because I wasn't  
>> sure if this is a dev question or an user question, feel free to  
>> either forward it to the right mailing list or let me know and I  
>> can forward it.
>> Thanks
>> -John
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message