lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter>
Subject Re: One (large) field shared by many documents
Date Sun, 20 May 2007 17:52:59 GMT
Thanks for your reply. This is getting me much deeper into the uncharted 
territories of Lucene, especially the area of FieldCaches, but it's also 
piqued my curiosity. Most of what I've been able to find are discussions 
by people that are already using FieldCache, rather than explanations of 
what they actually are. From what I understand a FieldCache caches 
certain values, and has methods that retrieve the information from the 
cache, or from a provided IndexReader if the cache doesn't have the 
requested value. My main question is where to get a fieldcache, and how 
to add things to it. The only publically available one in the API seems 
to be FieldCache.DEFAULT, but you speak of multiple fieldcaches. I could 
of course borrow FieldCacheImpl by copying the file to my own package, 
but that would probably affect the license of my code (not that I care 
much about that, but it does feel like another can of worms).

Do the scores for the collection id's get stored in FieldCache.DEFAULT 
by the searcher or should I see to that myself? And what exactly does 
the String field parameter in the getters do? Is this a lucene field, or 
simply a key with which to retrieve the cached values?

I'm sorry to be asking this many questions. Normally I would dig into 
the source code and try to figure this out myself, but I have a deadline 
that is approaching at a rather frightening speed. And in any case, it 
doesn't hurt to have these issues explained somewhere on the internet, 
in case somebody finds himself in the same situation.

All this business about fieldcaches, has led me to think that I might be 
better off caching the collection scores for each query myself. The 
process would then look like this:
* query the collection index with the user query, and calculate the 
scores per collection (possibly using only the top n collections, if I 
get too many)
* store the collection scores in a (weak) HashMap<String, Float> (or 
maybe a treemap) mapping collection id's (which are strings) to the 
collection scores (which are floats)
* retrieve all documents in all collections (and perhaps any documents 
that fit the query by themselves, if I ignored any collections)
* during the scoring process, score the document normally, retrieve the 
collection id from the document, retrieve the collection score from the 
hashmap and add it to the original score (possibly multiplying it by a 
scalar 0<s<1, to diminish the effect of the collection. As far as I can 
tell, this returns the same score as it would when the collection is 
just another field in the document (boosted by s).

If I understand you correctly, the FieldCache would take the place of 
the HashMap. Does this approach have any significant problems compared 
to using a FieldCache?

Another point I'm unclear about is where exactly to implement the last 
step. IndexSearcher seems to call Scorer.score(HitCollector) for the 
whole set of documents, which looks like it has a score() method for a 
single document/query collection. I guess I could extend Scorer to wrap 
around the regular scorer used, but this would also require me to extend 
Weight. I'm hoping there's an easier way to accomplish all this.

The two performance penalties here that I can see are retrieving all 
documents from all returned collections (as pointed out by Erick), since 
it requires a whole bunch of OR clauses (for collection id) and 
populating the hash map with the collection id's. The effect of both 
depends on the number of collections. Unfortunately, a closer look at 
the data tells me that the amount of collections is around a hundred 
thousand.On the other hand, any reasonable query should return only as 
much collections as it would from a set of medium sized documents. I 
guess the only way to find out how bad the performance will be, is to 
implement it.


Paul Elschot wrote:
> On Sunday 20 May 2007 02:49, Peter Bloem wrote:
>> Ah, now we're getting somewhere. So I run the first query on the 
>> collection index, get a set of collection id's from that. But how do I 
>> use them in the second query on the document index? It should be easy 
>> enough to retrieve all documents in the returned collections (which is 
>> what I'm after), but then I want to rank them as if they had the 
>> collection's term vector as a field. Is there some way to modify a 
>> document just prior to processing?
> One way is indeed to index the docs and collections separately.
> The trick is to use FieldCaches for your collection id's.
> The price is that these FieldCaches must be loaded initially.
> First query the collections, using a HitCollector
> to keep all their scores by your collection id using a FieldCache.
> By the time your indexes get really large, you may
> want to keep only a maximum number of the best scoring
> collections here.
> Then query the docs, and smooth their scores during the search
> using the collection scores, using a FieldCache for the collection
> id per doc. For this, have a look at the IndexSearcher code on how
> to hook in your own smoothing HitCollector to return for example
> a TopDocs.
>> I have several thousand collections, but the number of collections 
>> matching a query should remain quite small. The collections contain 
>> about as much text as a small webpage, so the chance that one query 
>> matches huge amounts of collections is small. If this does become a 
>> problem, I could still store the document id's. My data won't change, so 
>> there's no danger of the document id's changing. The end result of the 
>> project has to look like a production system, but it doesn't have to be 
>> one. :)
> This does not sound like something that will run out of RAM for 
> a few FieldCaches.
>> I can see why using Lucene like a database is worrying. There's already 
>> the problem of referential integrity (what if you update 
>> document/collection id's), which databases do well, and Lucene doesn't 
>> do at all (as there doesn't seem to be a standard mechanism for this 
>> sort of thing).
> Relational databases also use caches for various relational keys.
> With lucene you just have to explicitly choose them and program their use.
>> On the other hand, I don't think this technique is very  
>> new. I think it's a common smoothing method in xml element retrieval  
>> (smoothing an element with the contents of its ancestor elements). So 
>> surely this sort of thing gets done a lot. I guess there are bound to be 
>> some limits to the inverse index that require less pretty tricks like these.
> For small texts like link anchors it is easier to add them to each page as a 
> "relational attribute".
> A collection text looks more like a sum text than like a relational attribute, 
> so treating it as a separate lucene doc (lucene "entity") feels just about 
> right.
> Regards,
> Paul Elschot
>> regards,
>>  Peter
>> Erick Erickson wrote:
>>> You're right, your index will bloat considerably. In fact, I'm surprised
>>> it's only a factor of 5....
>>> The only thing that comes to mind is really a variant on your approach
>>> from your first e-mail. But I wouldn't use document ids because document
>>> IDs can change. So using doc IDs fraught.
>>> So here's the variant. Go ahead and index your "collection vector",
>>> but index it with a second field that is your "collection ID". Then, add
>>> that collection ID to each document in your original index. So, you have
>>> something like
>>> a: text:{look, a, cat}  collectionID:32
>>> b: text:{my, chimpansee, is , hairy} collectionID:32
>>> c: text:{dogs, are, playful} collectionID:32
>>> Your other index has
>>> collectionID:32 collectionVector:{look, a, cat, my, chimpansee, is , 
>>> hairy,
>>> dogs, are, playful}
>>> Now, you essentially make two queries, one to get a set of
>>> collection IDs from your second index (that is, querying your terms
>>> against collectionVector) and using that set of collectionIDs in a
>>> query against your first index.
>>> You might be able to do some interesting things with boosts
>>> to score either query more to your liking.
>>> This will come close to doubling the size of your index, but your
>>> first approach could bloat it by an arbitrary factor depending upon
>>> how many documents were in your largest collection.....
>>> One thing to note, however, is that there is no need to have
>>> two separate physical indexes. Lucene does not require that
>>> all  documents have the same fields. So this could all be in one
>>> big happy index. As long as the fields are different in the two
>>> sets of documents, the queries won't interfere with each other. In
>>> that case, you'd have to name the "foreign key" field differently for
>>> the sets of documents, say collectionID1 and collectionID2.
>>> All that said, this approach bothers me because it's mixing
>>> some database ideas with a Lucene index. I suppose in a controlled
>>> situation where you won't be trying to do arbitrary joins it's probably
>>> a misplaced unease. But I'm leery of trying to make Lucene act
>>> like a database. But that may just be a personal problem <G>
>>> The only other consideration is "how many collections do you have?"
>>> The reason I ask is that in the worst case scenario, you'll have an
>>> OR clause for every collection ID you have. Lucene can easily handle
>>> many thousands of terms in an OR, but your search time will suffer.
>>> And you'll have to take special action (really, just set 
>>> MaxBooleanClauses)
>>> if this is over 1024 or you'll get a TooManyClauses exception.
>>> Best
>>> Erick
>>> On 5/19/07, Peter Bloem <> wrote:
>>>> I'm sorry, I should have explained the intended behavior more clearly.
>>>> The basic idea (without the collection fields) is that there are very
>>>> simple documents in the index with one content field each. All I do with
>>>> this index is a standard search in this text field. To improve the
>>>> search results, I want to also add the concatenation of all documents in
>>>> a collection as a field to every single document. I then search the
>>>> index using both fields, and diminishing the effect of the collection
>>>> field. This should improve the search results.
>>>> As an example, say I have the documents a:"look a cat" b:"my chimpansee
>>>> is hairy" c:"dogs are playful" and many others. These three documents
>>>> are grouped into one collection (of many). The term vectors for the
>>>> documents would then be
>>>> a: {look, a, cat}
>>>> b: {my, chimpansee, is , hairy}
>>>> c: {dogs, are, playful}
>>>> If I create a term vector for the whole collection: {look, a, cat, my,
>>>> chimpansee, is , hairy, dogs, are, playful} and add it to each of the
>>>> documents as a separate field, the query "my hairy cat" scores well
>>>> against document a because of the match on cat, but also because of the
>>>> match on both cat and hairy on the collection field. Documents about the
>>>> linux command 'cat' do not have the word "hairy" in their collection
>>>> field (because they're part of a different collection), and so would not
>>>> get this benefit. It's essentially a smoothing technique, since it
>>>> allows query words that aren't in the document to still have some 
>>>> effect.
>>>> The problem of course is that storing these collection term vectors for
>>>> each document greatly increases the size of the index and the indexing
>>>> time. It would be alot faster if I could somehow use a second index to
>>>> store the collections as documents, so I would only have to store one
>>>> term vector per collection. (This isn't my own idea btw, I'm trying to
>>>> replicate the results from some other research that used this method).
>>>> I hope this is more clear,
>>>> Peter
>>>> Erick Erickson wrote:
>>>>> This seems kind of kludgy, but that may just mean I don't understand
>>>>> your problem very well.
>>>>> What is it that you're trying to accomplish? Searching constrained
>>>>> by topic or groups?
>>>>> If you're trying to search by groups, search the archive for the
>>>>> word "facet" or "faceted search".
>>>>> Otherwise, could you describe what behavior you're after and maybe
>>>>> there'd be more ideas....
>>>>> Best
>>>>> Erick
>>>>> On 5/19/07, Peter Bloem <> wrote:
>>>>>> Hi,
>>>>>> I have the following problem. I'm indexing documents that belong
>>>> some
>>>>>> collection (ie. the dataset is divided into collections, which are
>>>>>> divided into documents). These documents become my lucene documents,
>>>>>> with some relatively small string that becomes the field I want to
>>>>>> search. However, I would also like to add to document d the
>>>>>> concatenation of all documents in d's collection as a field 
>>>> (mainly as
>>>> a
>>>>>> smoothing technique, because documents correspond roughly to topics).
>>>>>> I'm currently doing just that, adding an extra field for the entire
>>>>>> concatenated collection to each document in that collection. Of 
>>>> course
>>>>>> this increases the index size and indexing time greatly (about
>>>>>> five-fold).
>>>>>> There must be a better way to do this. My idea was to create a second
>>>>>> index where the collections are indexed as (lucene) documents. This
>>>>>> index would have the text as a field, and a list of document id's
>>>>>> referring back to the main index. I could then retrieve the term

>>>> vector
>>>>>> for each collection from this second index for each search result

>>>> from
>>>>>> the original index.
>>>>>> My question is if this is a smart approach. And if it is, which of
>>>>>> Lucene's classes should I use for this. The best I could find was
>>>>>> FilterIndexReader. If extending the FilterIndexReader is really the
>>>> best
>>>>>> way to go, could I simply override the document(int, FieldSelector)
>>>>>> method, or is there more to it? I doubt I'm the first person that's
>>>> ever
>>>>>> wanted a many to one relation between fields and documents, so I
>>>>>> there's a simpler way about this.
>>>>>> Thank you,
>>>>>> Peter
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:
>>>>>> For additional commands, e-mail:
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message