lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <ben...@basistech.com>
Subject Re: Confuse with Kuromoji
Date Sun, 06 Apr 2014 14:37:27 GMT
On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat <herb.roitblat@orcatec.com>wrote:

> Just curious, what are some of the things that people do to properly
> tokenize the queries with mixed language collections?  What do you do with
> mixed language queries?
>

You can either force the user to tell you the language, or ...

   you can run a language detector. They are less accurate for short
strings, or ...

     you can process it in _all_ of the languages and OR up the results.



>
> On 4/6/2014 4:51 AM, Benson Margulies wrote:
>
>> You must know what language each text is in, and use an appropriate
>> analyzer. Some people do this by using a separate field (text_eng,
>> text_spa, text_jpn). Other people put some extra information at the
>> beginning of the field, and then make an analyzer that peeks in order to
>> dispatch to the correct tokenizer.
>>
>>
>> On Sat, Apr 5, 2014 at 9:59 PM, <j7a42e4fd7qux@softbank.ne.jp> wrote:
>>
>>  I am pretty new with Lucene, however I have not problem understanding
>>> what
>>> is about.
>>> My big problem is trying to understand how Kuromoji works. I need to
>>> implement a search functinality thats supports initially English, Spanish
>>> and Japanese. I doesn't seem to be a deal with the two firsts, as I can
>>> just use the analyzersーcommon to index both languages contents, but when
>>> it
>>> comes to Japanese it has it's own analyzer. I could't find any clues
>>> about
>>> combining analyzers, so I still don't if I can combine all languages
>>> under
>>> the same index (which would be ideal, as I expect mix searches in the
>>> context of my project) or I have to detect the language first and then
>>> index Japanese texts separately (what it will be a big disadvantage when
>>> it
>>> comes to mixed searches and future localization expansion).
>>> I found out about Lucene throgh Kuromoji, it will be great to find out a
>>> solution to be able to use all the greatness that Lucene offers.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message