lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Herb Roitblat <>
Subject Re: Confuse with Kuromoji
Date Sun, 06 Apr 2014 15:05:10 GMT
These are familiar.  Any other approaches that people use?  I guess I'm 
hoping ...
On 4/6/2014 7:37 AM, Benson Margulies wrote:
> On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat <>wrote:
>> Just curious, what are some of the things that people do to properly
>> tokenize the queries with mixed language collections?  What do you do with
>> mixed language queries?
> You can either force the user to tell you the language, or ...
>     you can run a language detector. They are less accurate for short
> strings, or ...
>       you can process it in _all_ of the languages and OR up the results.
>> On 4/6/2014 4:51 AM, Benson Margulies wrote:
>>> You must know what language each text is in, and use an appropriate
>>> analyzer. Some people do this by using a separate field (text_eng,
>>> text_spa, text_jpn). Other people put some extra information at the
>>> beginning of the field, and then make an analyzer that peeks in order to
>>> dispatch to the correct tokenizer.
>>> On Sat, Apr 5, 2014 at 9:59 PM, <> wrote:
>>>   I am pretty new with Lucene, however I have not problem understanding
>>>> what
>>>> is about.
>>>> My big problem is trying to understand how Kuromoji works. I need to
>>>> implement a search functinality thats supports initially English, Spanish
>>>> and Japanese. I doesn't seem to be a deal with the two firsts, as I can
>>>> just use the analyzersーcommon to index both languages contents, but when
>>>> it
>>>> comes to Japanese it has it's own analyzer. I could't find any clues
>>>> about
>>>> combining analyzers, so I still don't if I can combine all languages
>>>> under
>>>> the same index (which would be ideal, as I expect mix searches in the
>>>> context of my project) or I have to detect the language first and then
>>>> index Japanese texts separately (what it will be a big disadvantage when
>>>> it
>>>> comes to mixed searches and future localization expansion).
>>>> I found out about Lucene throgh Kuromoji, it will be great to find out a
>>>> solution to be able to use all the greatness that Lucene offers.
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message