lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Arens <>
Subject Re: Adding Fields to Documents with UnStored Fields - crazy scheme?
Date Fri, 09 Jun 2006 07:47:16 GMT

On Jun 9, 2006, at 2:10 AM, Chris Hostetter wrote:

> : 2. Recreating the index from scratch will require the moving of the
> : heavens and the earth.
> :
> : My crazy idea - can we add new Documents to the index with the  
> Fields
> : we wish to add, and duplicate file IDs? i.e. an entry for file ID  
> Foo
> : would consist of two Documents,
> : Document X: fileID:<Foo>, contents:<unknown>
> : Document Y:fileID:<Foo>, title:<Bar>, url:<>, etc.
> :
> : It would be no problem to implement different Searcher objects to
> : look at specific Fields, we were already leaning in that direction
> : anyhow.
> you certainly could do that .. but what exactly would the point be? ..
> presumably you currently query for "contents:germany" and get back the
> fileIDs of files that contain the work germany in their contents --  
> if you
> add another document with the same fileID and a title field and a url
> field, and you search for "contents:germany" you're still going to get
> back the same document -- it's not going to magically have the other
> fields in it just because they have the same fileID.
That kinda would be the point - "contents:germany" would get the same  
fileIDs, but "contents:germany title:medicine" would (hopefully) give  
us a more specific query.

> I supose you could do the search on contents, get back the fileIDs and
> *then* do another search for those fileIDs to get back the titles  
> and urls
> ... but i can't imagine earth and the heavens are that hard to move  
> that
> you'd want to jump through that hoop on every search.
Good point. Perhaps the better idea would be to build a separate  
index with the fields to be added, and create a MultiSearcher to  
operate over both indices.

> (if you're goingto add these new documents with the title and url  
> and all
> that -- why can't you add the contents atthe same time ... are the
> contents stored someplace else that you no longer have access to -  
> but you
> do have access to all the other fields???)
That is indeed the case. We have a BerkeleyDB with titles, URLs, that  
sort of thing for an on-disk precache, but a) the code written to  
actually generate the Lucene index is terrible, b) the resources used  
to generate the index are scattered at best, missing at worst, and c)  
the person who wrote the code isn't available any more. I was hoping  
to find some Brilliant Plan to get this done quickly (we're demoing  
for the World Health Organization sometime this week).

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message