lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter>
Subject Re: One (large) field shared by many documents
Date Sat, 19 May 2007 22:17:00 GMT
I'm sorry, I should have explained the intended behavior more clearly.

The basic idea (without the collection fields) is that there are very 
simple documents in the index with one content field each. All I do with 
this index is a standard search in this text field. To improve the 
search results, I want to also add the concatenation of all documents in 
a collection as a field to every single document. I then search the 
index using both fields, and diminishing the effect of the collection 
field. This should improve the search results.

As an example, say I have the documents a:"look a cat" b:"my chimpansee 
is hairy" c:"dogs are playful" and many others. These three documents 
are grouped into one collection (of many). The term vectors for the 
documents would then be
a: {look, a, cat}
b: {my, chimpansee, is , hairy}
c: {dogs, are, playful}
If I create a term vector for the whole collection: {look, a, cat, my, 
chimpansee, is , hairy, dogs, are, playful} and add it to each of the 
documents as a separate field, the query "my hairy cat" scores well 
against document a because of the match on cat, but also because of the 
match on both cat and hairy on the collection field. Documents about the 
linux command 'cat' do not have the word "hairy" in their collection 
field (because they're part of a different collection), and so would not 
get this benefit. It's essentially a smoothing technique, since it 
allows query words that aren't in the document to still have some effect.

The problem of course is that storing these collection term vectors for 
each document greatly increases the size of the index and the indexing 
time. It would be alot faster if I could somehow use a second index to 
store the collections as documents, so I would only have to store one 
term vector per collection. (This isn't my own idea btw, I'm trying to 
replicate the results from some other research that used this method).

I hope this is more clear,

Erick Erickson wrote:
> This seems kind of kludgy, but that may just mean I don't understand
> your problem very well.
> What is it that you're trying to accomplish? Searching constrained
> by topic or groups?
> If you're trying to search by groups, search the archive for the
> word "facet" or "faceted search".
> Otherwise, could you describe what behavior you're after and maybe
> there'd be more ideas....
> Best
> Erick
> On 5/19/07, Peter Bloem <> wrote:
>> Hi,
>> I have the following problem. I'm indexing documents that belong to some
>> collection (ie. the dataset is divided into collections, which are
>> divided into documents). These documents become my lucene documents,
>> with some relatively small string that becomes the field I want to
>> search. However, I would also like to add to document d the
>> concatenation of all documents in d's collection as a field (mainly as a
>> smoothing technique, because documents correspond roughly to topics).
>> I'm currently doing just that, adding an extra field for the entire
>> concatenated collection to each document in that collection. Of course
>> this increases the index size and indexing time greatly (about 
>> five-fold).
>> There must be a better way to do this. My idea was to create a second
>> index where the collections are indexed as (lucene) documents. This
>> index would have the text as a field, and a list of document id's
>> referring back to the main index. I could then retrieve the term vector
>> for each collection from this second index for each search result from
>> the original index.
>> My question is if this is a smart approach. And if it is, which of
>> Lucene's classes should I use for this. The best I could find was the
>> FilterIndexReader. If extending the FilterIndexReader is really the best
>> way to go, could I simply override the document(int, FieldSelector)
>> method, or is there more to it? I doubt I'm the first person that's ever
>> wanted a many to one relation between fields and documents, so I hope
>> there's a simpler way about this.
>> Thank you,
>> Peter
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message