lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <>
Subject Re: Confuse with Kuromoji
Date Sun, 06 Apr 2014 14:37:27 GMT
On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat <>wrote:

> Just curious, what are some of the things that people do to properly
> tokenize the queries with mixed language collections?  What do you do with
> mixed language queries?

You can either force the user to tell you the language, or ...

   you can run a language detector. They are less accurate for short
strings, or ...

     you can process it in _all_ of the languages and OR up the results.

> On 4/6/2014 4:51 AM, Benson Margulies wrote:
>> You must know what language each text is in, and use an appropriate
>> analyzer. Some people do this by using a separate field (text_eng,
>> text_spa, text_jpn). Other people put some extra information at the
>> beginning of the field, and then make an analyzer that peeks in order to
>> dispatch to the correct tokenizer.
>> On Sat, Apr 5, 2014 at 9:59 PM, <> wrote:
>>  I am pretty new with Lucene, however I have not problem understanding
>>> what
>>> is about.
>>> My big problem is trying to understand how Kuromoji works. I need to
>>> implement a search functinality thats supports initially English, Spanish
>>> and Japanese. I doesn't seem to be a deal with the two firsts, as I can
>>> just use the analyzersーcommon to index both languages contents, but when
>>> it
>>> comes to Japanese it has it's own analyzer. I could't find any clues
>>> about
>>> combining analyzers, so I still don't if I can combine all languages
>>> under
>>> the same index (which would be ideal, as I expect mix searches in the
>>> context of my project) or I have to detect the language first and then
>>> index Japanese texts separately (what it will be a big disadvantage when
>>> it
>>> comes to mixed searches and future localization expansion).
>>> I found out about Lucene throgh Kuromoji, it will be great to find out a
>>> solution to be able to use all the greatness that Lucene offers.
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message