lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1799) enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter
Date Sun, 28 Feb 2010 13:42:05 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839419#action_12839419
] 

Yonik Seeley commented on SOLR-1799:
------------------------------------

Thanks Chris - this actually is sort of like an approach I was thinking about recently (use
a new analysis attribute to somehow represent equivalent runs of tokens that currently can't
currently be represented by the linear tokenstream, and then modify getFieldQuery to "do the
right thing").

Same sort of thing is needed for synonyms - of course that doesn't solve the full problem
since the QP feeds the analyzer a word at a time unless it's a quoted  phrase.

> enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter
> ----------------------------------------------------------------------
>
>                 Key: SOLR-1799
>                 URL: https://issues.apache.org/jira/browse/SOLR-1799
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3, 1.4
>            Reporter: Chris Darroch
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: SOLR-1799.patch
>
>
> At the bottom of the WordDelimiterFilter.java code there's the following comment:
> // downsides:  if source text is "powershot" then a query of "PowerShot" won't match!
> Another serious example for us might be something like an indexed document containing
the word "Tribeca" or "Soho", and then a user trying to search for "TriBeCa" or "SoHo".
> This issue has turned up in a couple of recent mailing list threads:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%3cfe4f94830908201429j3ffbcdd3s3cb7d80542b31e48@mail.gmail.com%3e
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3c72d9e9500905121619p68c27099ibc7079e52cb0e48e@mail.gmail.com%3e
> In the first thread I found the best explication of what my own misunderstanding was,
and it's something I'm sure must trip up other people as well:
> {quote}
> I've misunderstood WordDelimiterFilter.  You might think that catenateAll="1" would append
the full phrase (sans delimiters) as an OR against the query.  So "jOkersWild" would produce:
> "j (okers wild)" OR "jokerswild"
> But you thought wrong.  Its actually:
> "j (okers wild jokerswild)"
> Which is confusing and won't match...
> {quote}
> In the second thread, Yonik Seeley gives a good explanation of why this occurs, and provides
a suggested workaround where you duplicate your data fields and then query on one using generateWordParts="1"
and on the other using catenateWords="1".  That works, but obviously requires data duplication.
 In our case, we are also following what I believe is recommended practice and duplicating
our data already into stemmed and unstemmed indexes.  To my mind, to further duplicate both
of these fields a second time, with no difference in the indexed data of the additional copy,
seems needlessly wasteful when the problem lies entirely in the query side of things.
> At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky, but seems
to work for us.  In WordDelimiterFilter, if generateWordParts="1" and catenateWords="2", then
we move the concatenated word to overlap its position with the first generated token instead
of the last (which is the behaviour with catenateWords="1").  We further insert a preceding
dummy flag token with the special type "CATENATE_FIRST".
> In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the entirety
of the getFieldQuery() code from Lucene's QueryParser.  This is ugly, I know.  This code is
then tweaked so that in the case where the dummy flag token is seen, it creates a BooleanQuery
with the following token (the concatenated word) as a conditional TermQuery clause, and then
adds the generated terms in their usual MultiPhraseQuery as a second conditional clause.
> Now I realize this patch is (a) not likely acceptable on style and elegance grounds,
and (b) only against Solr 1.3, not trunk.  My apologies for both; after I'd spent most of
what time I had available tracking down the source of the problem, I just needed to get something
working quickly.  Perhaps this patch will inspire others to greatness, though, or at a minimum
provide a starting point for those who stumble over this same issue.
> Thanks for a great application!  Cheers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message