lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Berryman <jfberry...@gmail.com>
Subject Issues with whitespace tokenization in QueryParser
Date Mon, 11 Jun 2012 03:03:40 GMT
According to https://issues.apache.org/jira/browse/LUCENE-2605, the Lucene
QueryParser tokenizes on white space before giving any text to the
Analyzer. This makes it impossible to use multi-term synonyms because the
SynonymFilter only receives one word at a time.

Resolution to this would really help with my current project. My project
client sells clothing and accessories online. They have plenty of examples
of compound words e.g."rain coat". But some of these compound words are
really tripping them up. A prime example is that a search for "dress shoes"
returns a list of dresses and random shoes (not necessarily dress shoes). I
wish that I was able to synonym compound words to single tokens (e.g.
"dress shoes => dress_shoes"), but with this whitespace tokenization issue,
it's impossible.

Has anything happened with this bug recently? For a short time I've got a
client that would be willing to pay for this issues to be fixed if it's not
too much of a rabbit hole. Anyone care to catch me up with what this might
entail?

-- 
LinkedIn <http://www.linkedin.com/pub/john-berryman/13/b17/864>
Twitter <http://twitter.com/#!/jnbrymn>

Mime
View raw message