lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: lucene indexing and merge process
Date Sat, 20 Oct 2007 11:57:56 GMT
John,

For case 1, can you describe your document structure?  Do you have a  
lot of other fields besides the UID field?  Most importantly, do you  
have some large fields?

Did you give the FieldSelector mechanism a try?

In fact, I think you may even be able to create a caching  
FieldSelector implementation.  We could a add a FieldSelectorResult,  
something like LOAD_AND_CACHE that then caches the info for that Doc,  
Field combination.  Would have to investigate further, but it seems  
like it might work.

Just thinking out loud...

-Grant

On Oct 18, 2007, at 10:38 AM, Erik Hatcher wrote:

> Forwarding this to java-dev per request.  Seems like the best place  
> to discuss this topic.
>
> 	Erik
>
>
> Begin forwarded message:
>
>> From: "John Wang" <john.wang@gmail.com>
>> Date: October 17, 2007 5:43:29 PM EDT
>> To: erik@ehatchersolutions.com
>> Subject: lucene indexing and merge process
>>
>> Hi Erik:
>>
>>     We are revamping our search system here at LinekdIn. And we  
>> are using Lucene.
>>
>>     One issue we ran across is that we store an UID in Lucene  
>> which we map to the DB storage. So given a docid, to lookup its  
>> UID, we have the following solutions:
>>
>> 1) Index it as a Stored field and get it from reader.document  
>> (very slow if recall is large)
>> 2) Load/Warmup the FieldCache (for large corpus, loading up the  
>> indexreader can be slow)
>> 3) construct it using the FieldCache and persist it on disk  
>> everytime the index changes. (not suitable for real time indexing,  
>> e.g. this process will degrade as # of documents get large)
>>
>>     None of the above solutions turn out to be adequate for our  
>> requirements.
>>
>>      What we end up doing is to modify Lucene code by changing  
>> SegmentReader,DocumentWriter,and FieldWriter classes by taking  
>> advantage of the Lucene Segment/merge process. E.g:
>>
>>      For each segment, we store a .udt file, which is an int[]  
>> array, (by changing the FieldWriter class)
>>
>>      And SegmentReader will load the .udt file into an array.
>>
>>      And merge happens seemlessly.
>>
>>      Because the tight encapsulation around these classes, e.g.  
>> private and final methods, it is very difficult to extend Lucene  
>> while avoiding branch into our own version. Is there a way we can  
>> open up and make these classes extensible? We'd be happy to  
>> contribute what we have done.
>>
>>      I guess to tackle the problem from a different angle: is  
>> there a way to incorporate FieldCache into the segments (it is  
>> strictly in memory now), and build disk versions while indexing.
>>
>>
>>      Hope I am making sense.
>>
>>     I did not send this out to the mailing list because I wasn't  
>> sure if this is a dev question or an user question, feel free to  
>> either forward it to the right mailing list or let me know and I  
>> can forward it.
>>
>>
>> Thanks
>>
>> -John
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message