lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Advice on Custom Sorting
Date Mon, 25 Sep 2006 21:01:30 GMT
You were probably right. See below....

On 9/25/06, Paul Lynch <pablolynch@yahoo.com> wrote:
>
> Thanks for the quick response Erick.
>
> "index the documents in your preferred list with a
> field and index your non-preferred docs with a field
> subid?"
>
> I considered this approach and dismissed it due to the
> actual list of preferred ids changing so frequently
> (every 10 mins...ish) but maybe I was a little hasty
> in doing so. I will investigate the overhead in
> updating all docs in the index each time my list
> refreshes. I had assumed it was too prohibitive but I
> know what they say about assumptions :)


Lots of overhead. There's really no capability of updating a doc in place.
This has been on several people's wish-list. You'd have to delete every doc
that you wanted to change and re-add it. I don't know how many documents
this would be, if just a few it'd be OK, but if many.... I was assuming (and
I *do* know what they say about assumptions <G>) that you were just adding
to your preferred doc list every few minutes, not changing existing
documents....

It really does sound like you want a filter. I was pleasantly surprised by
how very quickly a filters are built. You could use a CachingWrapperFilter
to have the filter kept around automatically (I guess you'd only have one
per index update) to minimize your overhead for building filters, and
perhaps warm up your cache by firing a canned query at your searcher when
you re-open your IndexReader after index update. I think you'd have to do
the two-query thing in this case. If you wanted to really get exotic, you
could build your filter when you created your index and store it in a *very
special document* and just read it in the first time you needed it. Although
I've never used it, I guess you can store binary data. From the Javadoc

*Field<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20byte%5B%5D,%20org.apache.lucene.document.Field.Store%29>
*(String <http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html> name,
byte[] value, Field.Store<file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.Store.html>
 store)
          Create a stored field with binary value.

The only thing here is that the filters (probably wrapped in a
ConstantScoreQuery) lose relevance, but since you're sorting "one of several
ways", that probably doesn't matter.

Best
Erick



Should I be able to make this workable, the beauty of
> this solution would be that I would actually only need
> to query once. If I had a field which indicates
> whether it is a preferred doc or not, "all" I will
> have to do is sort across the two fields.
>
> Thanks again Erick. Any other suggestions are most
> welcome.
>
> Regards,
> Paul
>
> --- Erick Erickson <erickerickson@gmail.com> wrote:
>
> > OK, a really "off the top of my head" response, but
> > what the heck....
> >
> > I'm not sure you need to worry about filters. Would
> > it work for you to index
> > the documents in your preferred list with a  field
> > (called, at the limit of
> > my creativity, preferredsubid <G>) and index your
> > non-preferred docs with a
> > field subid? You'd still have to fire two queries,
> > one on subid (to pick up
> > the ones in your non-preferred list) and one on
> > preferredsubid.
> >
> > Since there's no requirement that all docs have the
> > same fields, your
> > preferred docs could have ONLY the preferredsubid
> > field and your
> > non-preferred docs ONLY the subid field. That way
> > you wouldn't have to worry
> > about picking the docs up twice.
> >
> > Merging should be simple then, just iterate over
> > however many hits you want
> > in your preferredHits object, then tack on however
> > many you want from your
> > nonPreferredHits object. All the code for the two
> > queries would be
> > identical, the only difference being whether you
> > specify "subid" or
> > "preferredsubid"......
> >
> > I can imagine several variations on this scenario,
> > but they depend on your
> > problem space.
> >
> > Whether this is the "best" or not, I leave as an
> > exercise for the reader.
> >
> > Best
> > Erick
> >
> > On 9/25/06, Paul Lynch <pablolynch@yahoo.com> wrote:
> > >
> > > Hi All,
> > >
> > > I have an index containing documents which all
> > have a
> > > field called SubId which holds the ID of the
> > > Subscriber that submitted the data. This field is
> > > STORED and UN_TOKENIZED
> > >
> > > When I am querying the index, the user can cloose
> > a
> > > number of different ways to sort the Hits. The
> > problem
> > > is that I have a list of SubIds that should appear
> > at
> > > the top of the results list regardless of how the
> > > index is sorted. In other words, lets suppose the
> > Hits
> > > should be sorted by DateAdded, I require the Hits
> > to
> > > be sorted by DateAdded for the SubIds in my list
> > and
> > > then by DateAdded for the SubIds not in my list.
> > >
> > > From reading previous discussions on the mailing
> > list,
> > > I believe I could achieve what I need by writing
> > > custom filters i.e. Run the query first with a
> > custom
> > > filter for the SubIds in my list and then a second
> > > time with a custom filter for the SubIds not in my
> > > list and then "merge" the results.
> > >
> > > I suppose my question is simple: Is there a better
> > way
> > > to achieve this?
> > >
> > > Couple of bits of info which I would influence
> > best
> > > design:
> > >
> > > - Index contains roughly 5M documents
> > > - There can be up to 10K different unique SubIds
> > > - My "Preferred SubId List" could contain any
> > > combination of the 10K SubIds including all or
> > none of
> > > them
> > > - My "Preferred SubId List" gets updated about 10
> > > times and hour so I could cache the custom filters
> > >
> > > Thanks in advance,
> > > Paul
> > >
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message