lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Plat <>
Subject Re: How to retrieve distinct field matches?
Date Fri, 16 Dec 2005 04:53:06 GMT
Ahh, interesting point, though I'm afraid it solves a different
problem than my intentions. Re-reading this, I think I've described my
problem in a very obscure way. Sorry :-/.

Basically, pretend I do a regular search for "category:fiction". After
stemming/etc, this would match any Document with a category of
"fiction", "non-fiction", "fictitious", etc. All 900+ of them.

BUT as far as the results are concerned, I'm not actually interested
in each Document that was hit, nor about any other field besides the
"category" field. I just want a list of the unique categories that
matched the search string of "fiction".

In this example, my ultimate goal would be a String[] of:

     { "fiction", "fictitious", "non-fiction" }

... without any costly iterations of all 900+ Hit Documents' category values of:

     { "fiction", "non-fiction", "fiction", "fiction", "fiction",
"fictitious", "non-fiction", ... }

Again, I want to find a *unique* list of "category" field values that
match certain query text.

I know this can be done using a second index, but wanted to be sure
there isn't an obvious, less-hacky way first. I'm used to Lucene
surprising me with sneaky efficiencies.

Thanks for the valiant effort to make sense of me! :)


On 12/15/05, Michael D. Curtin <> wrote:
> Mr Plate wrote:
> > This puzzle has been bugging me for a while; I'm hoping there's an
> > elegant way to handle it in Lucene.
> >
> >
> > I've got an index of over 100,000 Documents. In addition to other
> > fields, each of these Documents has 0 or more "category" field  values.
> > There are over 5,500 such categories (it's not a small set).  Anywhere
> > from 1 to 500+ Documents could belong to a single  "category". This
> > index does not get updated very often; anywhere from  once a day to once
> > a month. Indexing time is currently 15-30 minutes  from start to
> > finish/optimization.
> >
> >
> >
> > I'd like to provide users a way to search these "category" values.  For
> > example, suppose the user searches for "fiction". They might see
> > results of:  { "fiction", "non-fiction" }. However, I'd like to do  this
> > search as quickly and efficiently as reasonable. For example, if  there
> > are 500 Documents of category "fiction", and 400 of "non- fiction", I
> > don't want to Sort and iterate through each Hit to weed  out the
> > duplicate values from my query.
> >
> > For what it's worth, I imagine only 0-20 categories would match a  given
> > query.
> >
> >
> >
> > The best I can imagine is to maintain a separate Lucene index for  each
> > of these category types. Each Document in this separate index  would
> > probably have fields of "field_name", and "field_value", and  would not
> > contain any duplicates. For example, you might see a  Document of
> > field_name "category" and field_value "non-fiction". My  query would hit
> > this second index instead, to perform these metadata  searches.
> >
> >
> > I hope that makes sense; do you know of a more elegant way to handle
> > this type of problem?
> I'm guessing that each Document doesn't have a "category" field with
> multiple values in it but, instead, has a uniquely-named field for each
> category.  Would it work to change your data model to the former?  That
> is, have a Text field named "category" in each document, so that it gets
> tokenized and indexed.  Then you could do a search of the 5K category
> names (outside of Lucene, perhaps by getting the list of Terms from the
> "category" field) for the query term of interest, "fiction" in your
> example, then compose a Lucene query with the results.  Your example
> would produce a query equivalent to 'category:fiction
> category:non-fiction'.  For only 100K documents, this should be pretty fast.
> Good luck!
> --MDC
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message