lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Issues with whitespace tokenization in QueryParser
Date Mon, 11 Jun 2012 12:42:17 GMT
Welcome John!

Basically the tricky part about this issue is how Analyzer integrates
into the parsing workflow: It is as hossman says on the issue.

You can edit the .jflex file so that _TERM_CHAR is defined differently
and regenerate, and you will see what i mean by the tests that fail.

The crux of the problem is that currently if you have +foo bar -baz,
we split on whitespace, applying operators, then run the analyzer on
each portion.
so you get +foo, bar, -baz, then we analyze foo, bar, and baz respectively.

But if you just remove the whitespace tokenization, you will get +foo
bar, -baz, which is different.

so to make this kind of thing work as expected, I think the analyzer
would be integrated at an earlier stage here before the operators are
applied, e.g. its part of the lexing process.

NOTE: I definitely don't want to discourage you from tackling this
issue, but I think its fair to mention there is a workaround, and
thats if you can preprocess your queries yourself (maybe you dont
allow all the lucene syntax to your users or something like that), you
can escape the whitespace yourself such as rain\ coat, and I think
your synonyms will work as expected.

On Sun, Jun 10, 2012 at 11:03 PM, John Berryman <jfberryman@gmail.com> wrote:
> According to https://issues.apache.org/jira/browse/LUCENE-2605, the Lucene
> QueryParser tokenizes on white space before giving any text to the Analyzer.
> This makes it impossible to use multi-term synonyms because the
> SynonymFilter only receives one word at a time.
>
> Resolution to this would really help with my current project. My project
> client sells clothing and accessories online. They have plenty of examples
> of compound words e.g."rain coat". But some of these compound words are
> really tripping them up. A prime example is that a search for "dress shoes"
> returns a list of dresses and random shoes (not necessarily dress shoes). I
> wish that I was able to synonym compound words to single tokens (e.g. "dress
> shoes => dress_shoes"), but with this whitespace tokenization issue, it's
> impossible.
>
> Has anything happened with this bug recently? For a short time I've got a
> client that would be willing to pay for this issues to be fixed if it's not
> too much of a rabbit hole. Anyone care to catch me up with what this might
> entail?
>
> --
> LinkedIn
> Twitter
>



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message