lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3056) Introduce Japanese field type in schema.xml
Date Wed, 08 Feb 2012 15:48:03 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13203678#comment-13203678
] 

Robert Muir commented on SOLR-3056:
-----------------------------------

{quote}
However, if we following down this path path, we might also want to do width-normalization
for the Japanese stopset to make sure there's no confusion with that, either. I suggest that
we resolve that as a separate issue.
{quote}

Well, I think in general we could probably solve the width issue with documentation. 
The reason is that supporting a lot of different 'casing' schemes (especially ones that aren't
1:1, like normalizing width of kana),
in CharArrayMap/Set could become confusing and tricky.

For example, because GreekAnalyzer's stopword list expects sigma to always be 'σ' and never
'ς' (even in final position), we document
that the stopword list should also be configured this way:
{noformat}
   * <b>NOTE:</b> The stopwords set should be pre-processed with the logic of

   * {@link GreekLowerCaseFilter} for best results.
{noformat}

But, I think we should also document any expectations in the example file itself, now that
we are also using them as example configurations
for Solr users (who we might expect, would never read the javadocs to the corresponding Analyzer).

I'll redundantly add comments to the stoplists where appropriate for the other languages,
but I think its a good way to solve the width issue too.

                
> Introduce Japanese field type in schema.xml
> -------------------------------------------
>
>                 Key: SOLR-3056
>                 URL: https://issues.apache.org/jira/browse/SOLR-3056
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: SOLR-3056.patch, SOLR-3056_move.patch, SOLR-3056_schema40.patch,
SOLR-3056_schema40.patch, SOLR-3056_schema40.patch
>
>
> Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again Robert, Uwe
and Simon). It would be very good to get a default field type defined for Japanese in {{schema.xml}}
so we can good Japanese out-of-the-box support in Solr.
> I've been playing with the below configuration today, which I think is a reasonable starting
point for Japanese.  There's lot to be said about various considerations necessary when searching
Japanese, but perhaps a wiki page is more suitable to cover the wider topic?
> In order to make the below {{text_ja}} field type work, Kuromoji itself and its analyzers
need to be seen by the Solr classloader.  However, these are currently in contrib and I'm
wondering if we should consider moving them to core to make them directly available.  If there
are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources
are loaded lazily and only when needed in factory-land.
> Any thoughts?
> {code:xml}
> <!-- Text field type is suitable for Japanese text using morphological analysis
>      NOTE: Please copy files
>        contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
>        dist/apache-solr-analysis-extras-x.y.z.jar
>      to your Solr lib directory (i.e. example/solr/lib) before before starting Solr.
>      (x.y.z refers to a version number)
>      If you would like to optimize for precision, default operator AND with
>        <solrQueryParser defaultOperator="AND"/>
>      below (this file).  Use "OR" if you would like to optimize for recall (default).
> -->
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   <analyzer>
>     <!-- Kuromoji Japanese morphological analyzer/tokenizer
>          Use search-mode to get a noun-decompounding effect useful for search.
>          Example:
>            関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai)
国際 (International) 空港 (airport)
>            so we get a match for 空港 (airport) as we would expect from a good search
engine
>          Valid values for mode are:
>             normal: default segmentation
>             search: segmentation useful for search (extra compound splitting)
>           extended: search mode with unigramming of unknown words (experimental)
>          NOTE: Search mode improves segmentation for search at the expense of part-of-speech
accuracy
>     -->
>     <tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/>
>     <!-- Reduces inflected verbs and adjectives to their base/dectionary forms (辞書形)
-->	
>     <filter class="solr.KuromojiBaseFormFilterFactory"/>
>     <!-- Optionally remove tokens with certain part-of-speeches
>     <filter class="solr.KuromojiPartOfSpeechStopFilterFactory" tags="stopTags.txt"
enablePositionIncrements="true"/> -->
>     <!-- Normalizes full-width romaji to half-with and half-width kana to full-width
(Unicode NFKC subset) -->
>     <filter class="solr.CJKWidthFilterFactory"/>
>     <!-- Lower-case romaji characters -->
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message