lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micheal Cooper <cooper...@gmail.com>
Subject Re: looking for documentation on solr.JapaneseTokenizerFactory
Date Tue, 28 Jun 2016 08:03:55 GMT
Very nice. Thank you.

My non-Japanese devs had set Solr to use CJK for indexing and Whitespace Tokenizer for search,
which does not work at all because Japanese does not use whitespace. I was able to find settings
that seem to be working well. 

For reference for other knowledge-seekers:

I contacted the company that donated Kuromoji, the JapaneseTokenizer from Lucene that is used
in Solr, and they directed me to 
https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-Japanese
which has info for v6. The only problem I had was that it seems that JapaneseIterationMarkCharFilterFactory
is not available for v4.10, but I just removed it. It is an edge case, and I can look into
that later.

The other thing to be careful of is loading the library.
I could not reload the core because Solr could not load Kuromoji, and I found that that directory
was not loaded in the solrconfig.xml.
When I tried to use the default relative link method, it did not work. It seems to have something
to do with the Lucene libraries. The Japanese blog I found recommended using an absolute link,
so I put that in the ‘config’ section that loads library directories, and it worked.

Here are some links that also helped:
https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-Japanese
http://d.hatena.ne.jp/kahnn/20130828/1377645204
http://blog.flect.co.jp/labo/2012/10/solr40schemaxml-bf12.html

Micheal

On 2016/06/28, 16:10, "Alexandre Rafalovitch" <arafalov@gmail.com> wrote:

Have you seen http://discovery-grindstone.blogspot.com.au/ ? It is a
series of articles on setting up SJK for library content.

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 28 June 2016 at 10:59, Micheal Cooper <micheal.cooper@oist.jp> wrote:
> I have a vendor-supplied Solr 4.10 set up for multisite search which indexes two large
Drupal 7 sites which have content in Japanese, English, and Undefined.
>
> The English searches are OK, but the Japanese does not work well at all. The vendors
are in the US, so it is understandable that they cannot really test it for themselves.
>
> I am trying to fix this config before setting userdict, synonyms, stopwords, and the
like. There is obviously a problem with the Tokenization.
>
> I have searched Google in English and Japanese and Safari Books in English, but I cannot
find a definitive page or tutorial on setting up Solr with Kuromoji (JapaneseTokenizerFactory)
correctly, and the official documentation is not helpful. The comments for text_ja in the
config say "See http://wiki.apache.org/solr/JapaneseLanguageSupport for more on Japanese language
support," but when you go there, it just says, "This page will contain various information
on Japanese support in Lucene/Solr 3.6 & 4.0, but it currently just a filler...".
>
> Does anyone have a good source of info for setting up Solr for Japanese content?
>
> Micheal
>




Mime
View raw message