lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@syr.edu>
Subject Re: Lazy Field Loading
Date Tue, 04 Apr 2006 21:48:35 GMT


Yonik Seeley wrote:
> On 4/4/06, Grant Ingersoll <gsingers@syr.edu> wrote:
>   
>> I am not sure you need 509 when you have Lazy loading.
>>     
>
> It would be nice to avoid creating a Field object at all... we have
> some crazy documents with more than 1000 fields :-)  I think the Field
> object itself takes up more room than the data.
>
> For my usecases, specifying which fields should be lazily loaded
> doesn't work well...  I know which fields I want, not which ones I
> don't.
>
>   
true, true.  In looking at the code, I don't think it is that hard to 
do.  As 509 states, the main issue is you still need to read in all the 
Fields in a document.

Mark Harwood had an interesting post earlier on this same thread about 
some other possibilities for interfaces.


>> My use case is below (my guess is this is quite common).
>>
>> Run a search, get back your hits and display summary information on the
>> hits (i.e. the "small" fields).  User picks the Hit they want to see
>> more info on, go display the full document
>>     
>
> It seems like the only way this can work is if you keep the index
> searcher open and cache the Hits object that the user used.  How long
> do you keep that searcher open waiting for the user to do something? 
> I guess it could work as long as you have logic to re-execute the
> query if the searcher changes...
>   

Yeah, we aren't updating a lot, so we cache the searchers.  If you 
followed the other thread I have going on the "Semantics of 
IndexInput...", Doug and I discussed that accessing the stream becomes 
undefined after the stream is closed.  So, while it does still work to 
load in some cases, it isn't guaranteed and any application would need 
to be able to handle this.
>   
>> , including, most likely, the
>> info in the really large stored fields (i.e the original document).  To
>> date, I have been storing this info elsewhere b/c of the loading
>> penalty.  With lazy loading, I don't need to do this.  I can just defer
>> loading until the second level access is needed and I never load it if
>> the user doesn't ask for it.
>>     
>
> Actually, for really large text fields, I can see that you wouldn't
> want lucene to re-parse the fields anyway, so I agree that lazy
> loading helps there.
>
>   
>> In the case where you only get a few smaller fields, you have to go back
>> and get the document again when you want to display the contents of the
>> large field.
>>
>> Of course, there are several other use cases where you may only want
>> certain fields, but I don't think there is much cost associated with
>> loading small fields, just the large ones, so you can just make them lazy.
>>     
>
> Part of the cost is iterating through all the fields of the Document
> looking for the one or two you want.
>
>   

Yeah, not sure if there is a good solution to this.  Maybe altering the 
file formats such that you store all the meta info about a field up 
front and then the field data somehow.  This would at least speed it 
up.  One of the things I think both SOLR and what we call IR Tools at 
CNLP (see my ApacheCon talk) does is provide better access to the 
metadata about fields/indexes, etc.  It is hard, in Lucene, to know what 
fields belong to what documents and how they are indexed.  You must save 
this information in your application, even though most, if not all of 
it, is already in Lucene in some form.

I will take a crack at this sometime later and see if I can implement 
some of the ideas we have discussed.

As I see it, we have a few goals:
1. Retrieve only the fields someone wants
2. Retrieve only all fields, but leave some to be lazily loaded
3. Provide SQL like functionality (as Mark suggested) [a bit harder and 
more involved????]

-- 

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message