lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <mbenn...@ideaeng.com>
Subject Local Params syntax not protecting Shingles in DisMax from Lucene query parser
Date Wed, 22 Sep 2010 18:37:36 GMT
Background:
I've been interested in some specific 3 word shingles.  The idea is that,
although we throw out stop words like "the", "how", "it", etc, that some 3
word runs that contains those words are actually potentially useful, related
to my "Power Law" email a few days back.  BTW there's a paper that talks
about this, how phrases can act somewhat like unusual words from an IDF
perspective:
http://ciir-publications.cs.umass.edu/getpdf.php?id=184

The issue:
I really like the DisMax query parser, but of course its main design is a
bit at odds with shingles and phrases.  But I'd seen folks talk about using
the local parameters syntax.  For example, Chris had chimed in a while back
suggesting this approach:
http://www.lucidimagination.com/search/document/ea7b0b27b1b17b1c/re_replacing_fast_functionality_atsesam_no_shinglefilter_exactmatching
I've also done some other reading on the web and Lucid etc about the curly
brace syntax, etc.

But this doesn't seem to be working the way thought, with respect to
protecting text from the first pass Lucene parser.

I have a custom field defined for shingle_type / shingle_text, along with a
few classes.
If I run this through the analyzer:
    How does this work?
I get:
    1: how_does_this
    2: does_this_work
With the numbers being the offsets.

Now I combine that into dismax, and my regular fields which have aggressive
stop words:
Input:
    {!dismax qf="title^1.2 summary shingle_text^3.0" v="How does this
work?"}
Output:
    +((DisjunctionMaxQuery((title:work^1.2 | summary:work)))~1) ()

It SHOULD also have shingle_text:how_does_this and
shingle_text:does_this_work

>From the various threads about shingles, phrases and local parameters, I
thought having the v="stuff" would bypass the Lucene parser?

Thanks for any ideas y'all might have,
Mark

PS: I realize that adding "pf" would be similar to what I'm doing, but I
don't have as much control of the run of the phrases, and I've got some
pretty specific stats in my index on the shingles.  And also, I really want
to understand the parsing process.

PPS: I also looked at the XML query parser stuff, but it's not clear (to me)
when that will be in a mainline release (vs a patch), and for various
reasons a patch is not desirable on this project.

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Mime
View raw message