lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true
Date Thu, 09 Aug 2012 15:27:06 GMT
(Feel free to add these comments to the Jira I filed this morning:
https://issues.apache.org/jira/browse/SOLR-3723)

-- Jack Krupansky

-----Original Message----- 
From: Chris Hostetter
Sent: Thursday, August 09, 2012 11:22 AM
To: Lucene Dev
Subject: Re: Improve OOTB behavior: English word-splitting should default to 
autoGeneratePhraseQueries=true


: Can you honestly generalize this rule from "how to handle hyphen" to
: "if > 1 term comes out of a whitespace-separated term, it must be a
: phrase query?".

No, which is why i never said that.  what i said was "Hold on a minute and
think about what jack is pointing out here" -- instead of dismissing the
problem out of hand because you "could care less about english"

Just because you don't like Jack's suggested solution, doesn't make the
problem magically go away.  You may not care about english, but (suprise!)
lots of people do, and we should try to figure out some ways of mitigating
confusion like this people indexing english.

Maybe this is just a matter of better documentaiton, but it's at least
worth *discussing* what the possible solutions are, instead of being rude
and dismissive about the fact that the OOTB behavior is currently very
unintuitive for the english langauge.

Off the top of my head, i can think of several ideas (some trivial some
hypothetical) that *might* improve the OOTB experience for new users, that
are at least worth *discussing* ...

1) better class level QueryParser javadocs and example schema.xml comments
about the significance of autoGeneratePhraseQueries and the tradeoffs of
changing it.

2) mention autoGeneratePhraseQueries and it's trade-offs in the solr
tutorial

3) more configuration options in StandardTokenizer and
StandardTokenizerFactory about when/how tokens are split on things like
hyphen and comments about them in the example schema.xml

4) smarter logic/options in QueryParser for determining when to build a
phrase query automaticly based on the character ranges


: Even for english itself, its debatable:
:
: http://en.wikipedia.org/wiki/Hyphen#Varied_meanings

I'm not following your argument -- that URL demonstrates various
examples where {{ foo-bar }} has extremely differnt semantic meaning from
{{ foo bar }} ... which actually demonstrates the point I'm making:
it's highly unintuitive that a search for a hyphenated word like {{
foo-bar }} should be interpreted as "search for either of those words"


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message