lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3916) Consider different query and index segmentation for Japanese
Date Wed, 28 Mar 2012 18:33:29 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240610#comment-13240610
] 

Christian Moen commented on LUCENE-3916:
----------------------------------------

Thanks a lot, Robert.

I've added a comment about about this in {{schema.xml}} as part of SOLR-3276.  I'm resolving
this issue.


                
> Consider different query and index segmentation for Japanese
> ------------------------------------------------------------
>
>                 Key: LUCENE-3916
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3916
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>            Priority: Minor
>
> Kuromoji today uses search mode segmentation both at query and index time.
> The benefit with search mode segmentation is that it segments compounds such as 関西国際空港
(Kansai International Airport) into 関西 (Kansai), 国際 (international), 空港 (airport),
and leaves the compound 関西国際空港 as a synonym to 関西.
> This segmentation allows us to get a match for 空港 (airport), which is good for recall
and we'd get good precision when searching for the compound 関西国際空港 because of
IDF.
> However, if we search for the compound 関西国際空港 (Kansai International Airport)
our query becomes (by default) an OR-query with terms 関西 (Kansai), 関西国際空港
(Kansai International Airport), 国際 (international) and 空港 (airport).
> This behaviour is by-design when using OR as the default operator, but this also has
the effect of returning generic hits like 空港 (airport) when the user searches for something
very specific like 関西国際空港 (Kansai International Airport) -- and these hits are
also highlighted.
> This doesn't necessarily mean that ranking is flawed per se, but a user or application
might prefer precision over recall.  In order to favour precision, we can consider using normal
mode segmentation for queries, but retain search mode segmentation on the indexing side.
> Does anyone have any general opinion on this?  What would we do for other language in
the case of compound splitting?
> Perhaps this can be dealt with as a documentation issue with a comment in {{schema.xml}}
while keeping the current behaviour?
> Many thanks for any input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message