lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: Multi-Valued Faceting
Date Wed, 06 Dec 2006 20:06:37 GMT
On 12/6/06, J.J. Larrea <jjl@panix.com> wrote:
> My thought was that the simplest approach would be to subclass
> FieldCacheImpl to introduce a getMultiStringIndex method derived from
> getStringIndex, defining  and returning a MultiStringIndex class
> which stores order as int[][] rather than int[]; a variant of
> SimpleFacets.getFieldCacheCounts would simply need an inner loop to
> tally each of the Document's Term indexes for that field.

I think something like that is the right approach, the only problem
being the size in memory this would take up.  It may need some clever
encoding to keep it reasonable.

> With multi-valuedness no longer being a useful criterion for
> automatically choosing between the filter-based and modified
> FieldCache-based mechanisms, there then would need to be an alternate
> criterion, either implicit or explicit. Does anyone have any ideas
> how best to do that?  For example, is there a way to quickly
> determine the number of distinct Term values for a field without
> enumerating to the end, so the ratio of Terms to Documents can be
> used?

I'd suggest a Solr fieldInfo cache that stored info about a field:
a) the number of unique terms in the field
b) (optionally) a sorted list by docfreq of the top terms in the field

> An entirely alternate approach (briefly suggested in a comment in
> SimpleFacets) for fields indexed with term vectors would be to simply
> call getTermFreqVector, for each hit and store (term text, tally) in
> a HashTable, or (term text, index) in a HT which could be cached with
> tallies generated per-query in an array as they are now, in the
> latter case building a field-cache dynamically based on actual query
> results.  Does anyone have any insight on how efficient that may or
> may not be?

For queries that don't have many hits, termvectors would be fine.
I don't think they would perform well with a lot of hits though.
There could be a different type of faceting that just uses the top "n"
results though.

> And if I have gotten something dreadfully wrong in my understanding
> of current implementation or proposed enhancement, I would appreciate
> getting straightened out.

Sounds like you have a pretty good handle on it!

-Yonik

Mime
View raw message