lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Different structure of standard generated query for CJK vs. Western query
Date Wed, 15 Jul 2009 01:59:36 GMT

http://people.apache.org/~hossman/#solr-dev
Please Use "solr-user@lucene" Not "solr-dev@lucene"

Your question is better suited for the solr-user@lucene mailing list ...
not the solr-dev@lucene list.  solr-dev is for discussing development of
the internals of the Solr application ... it is *not* the appropriate
place to ask questions about how to use Solr (or write Solr plugins) 
when developing your own applications.  Please resend your message to
the solr-user mailing list, where you are likely to get more/better
responses since that list also has a larger number of subscribers.


: Date: Sun, 5 Jul 2009 22:09:07 -0700
: From: Mark Bennett <mbennett@ideaeng.com>
: Reply-To: solr-dev@lucene.apache.org
: To: solr-dev@lucene.apache.org
: Subject: Different structure of standard generated query for CJK vs. Western 
:     query
: 
: (resending with ALL Asian characters removed from example, which apparently
: trips a filter)
: I'm getting phrase queries instead of implicit "OR" queries with Asian
: text.  I first noticed it with the Dismax query handler, but it also happens
: with the Standard query.
: 
: Of course Asian text is broken up into N-Gram pairs, I understand that.  But
: after analysis (via the Web UI) the 2-character "words" still have spaces in
: between them, so I'd expect similar results to an English sentence which
: also has spaces.
: 
: English: (default field title_en)
: User Query: I need help with my iPod
: Generates: title_en:i title_en:need title_en:help title_en:with title_en:my
: title_en:ipod
: 
: Japanese: (default field title_cjk)
: User Query: iPodC1C2C3C4C5C6C7...
: Generates: PhraseQuery(title_cjk:"ipod C1C2 C2C3 C3C4 C4C5 C5C6 C6C7")
: The problem is the cjk phrase queries are too rigid, everything has to
: match.  Although setting phrase slop helps with proximity, I don't think you
: can tell it to not require 100% of the bigrams to be present.
: 
: What I'd like is just: title_cjk:ipod title_cjk:C1C2 title_cjk:C2C3
: title_cjk:C3C4 etc...
: The only theory I have so far, looking through the code and mailing list
: comments, this might have something to do with token offsets?  Though the
: start of each token is 1 past the previous one, they do overlap by 1 char
: each time.  I'm not sure that's it, nor what the logic would be.  Bumping
: the increments from 1 to 3 or 4 would make them no longer overlap, if that's
: all there is to it.
: 
: Ideally I'd like the cjk queries to be structured the same as the English
: ones.  Also it'd be better if this could be done with just schema or config
: changes, though I realize that's not as likely.
: 
: --
: Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
: Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
: 



-Hoss


Mime
View raw message