lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Einspanjer" <deinspan...@gmail.com>
Subject Re: Questions regarding Lucene query syntax
Date Sun, 06 May 2007 23:41:52 GMT
On 5/6/07, Erick Erickson <erickerickson@gmail.com> wrote:
>
> On 5/5/07, Daniel Einspanjer <deinspanjer@gmail.com> wrote:
> >
> > The query syntax reference page talks about the NOT and the - operators,
> > but
> > it wasn't clear to me what exactly the difference is between
> them.  Could
> > someone tell me briefly what that difference might be or point me at
> some
> > further docs that describe it?
>
> See the thread "Standard Parser Behavior". It has several explications
> of what the Lucene query syntax is all about. This confuses everybody,
> so I think that thread will help you a lot.
>
> Also, see http://wiki.apache.org/lucene-java/BooleanQuerySyntax



I'll take a look for this thread right now, and make sure I've already read
that wiki page.

Is there a way to require a portion of a query only if there are values for
> > that field in the document?
> > e.g. If I know that I only want to match movies made between 1973 and
> > 1975,
> > I would like to be able to say in my query that if the document has a
> > year,
> > it must be in that range, but if the document has no year at all, don't
> > fail
> > the document for that reason alone.
> > This is also important in the director name part.  If a document has a
> > director given, and it doesn't match what I'm searching for, that should
> > be
> > a fail, but if the document has no director field, I don't want to fail
> > the
> > document for that reason alone.
>
>
> You'll have to include a dummy value I think. Remember that you're
> searching for stuff with Lucene, so saying "match even if there's
> nothing there" is, er, ABnormal..
>
> I'd think about putting a dummy value in those fields you want to handle
> this way. For instance, add "matchall" to documents with no date. Then
> you'd need to add an 'or date:matchall' clause to all the dates you query
> on. Make sure it's a value that behaves reasonably when you want to
> include all dates, or all dates before ####, or all dates after ####.
>

Hrm.  I'll keep this idea on the cheat sheet for now. It turns out that
having a required date was causing too many mismatches for me.  Some of the
source feeds I'm matching have wildly inaccurate year fields, and when I
required that field, it would pull out some other poorly related item based
on the year and director, ignoring the right one because the year was bad.


By far the thing that is killing me the most is my trouble with trying to
provide users with scores that make sense from one item to the other.  I
tried out the SweetSpotSimilarity contrib, and I *think* it might have
helped the matching in general some, but it doesn't really give me a linear
range of scores that can be used for comparisons.  I keep scouring the web
looking for something that might explain enough tf and idf and norms in
terms that I could understand, but sadly, it just seems to be a bit over my
head right now. :/ Maybe I've just been fighting with this project for so
long my brain has turned to mush.

If I could find a way that the scores for the queries I've mentioned in this
thread and others could just return a simple linear scale (affected by
^boosts would be good though) for the number of terms matched, I think I'd
be all set.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message