lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Imbeault <michael.imbea...@sympatico.ca>
Subject Re: Facet performance with heterogeneous 'facets'?
Date Tue, 19 Sep 2006 03:31:32 GMT
Yonik Seeley wrote:
> I noticed this too, and have been thinking about ways to fix it.
> The root of the problem is that lucene, like all full-text search
> engines, uses inverted indicies.  It's fast and easy to get all
> documents for a particular term, but getting all terms for a document
> documents is either not possible, or not fast (assuming many documents
> match a query).
Yeah that's what I've been thinking; the index isn't built to handle 
such searches, sadly :( It would be very nice to be able to rapidly 
search by most frequent author, journal, etc.
> For cases like "author", if there is only one value per document, then
> a possible fix is to use the field cache.  If there can be multiple
> occurrences, there doesn't seem to be a good way that preserves exact
> counts, except maybe if the number of documents matching a query is
> low.
>
I have one value per document (I have fields for authors, last_author 
and first_author, and I'm doing faceted search on first and last authors 
fields). How would I use the field cache to fix my problem? Also, would 
it be better to store a unique number (for each possible author) in an 
int field along with the string, and do the faceted searching on the int 
field? Would this be faster / require less memory? I guess that yes, and 
I'll test that when I have the time.

>> Just a little follow-up - I did a little more testing, and the query
>> takes 20 seconds no matter what - If there's one document in the results
>> set, or if I do a query that returns all 130000 documents.
>
> Yes, currently the same strategy is always used.
>   intersection_count(docs_matching_query, docs_matching_author1)
>   intersection_count(docs_matching_query, docs_matching_author2)
>   intersection_count(docs_matching_query, docs_matching_author3)
>   etc...
>
> Normally, the docsets will be cached, but since the number of authors
> is greater than the size of the filtercache, the effective cache hit
> rate will be 0%
>
> -Yonik
So more memory would fix the problem? Also, I was under the impression 
that it was only searching / sorting for authors that it knows are in 
the result set... in the case of only one document (1 result), it seems 
strange that it takes the same time as for 130 000 results. It should 
just check the results, see that there's only one author, and return 
that? And in the case of 2 documents, just sort 2 authors (or 1 if 
they're the same)? I understand your answer (it does intersections), but 
I wonder why its intersecting from the whole document set at first, and 
not docs_matching_query like you said.

Thanks for the support,

Michael

Mime
View raw message