lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Binda <olivier.bi...@wanadoo.fr>
Subject Re: Confuse with Kuromoji
Date Sun, 06 Apr 2014 15:19:56 GMT
On 04/06/2014 04:37 PM, Benson Margulies wrote:
> On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat <herb.roitblat@orcatec.com>wrote:
>
>> Just curious, what are some of the things that people do to properly
>> tokenize the queries with mixed language collections?  What do you do with
>> mixed language queries?
>>
> You can either force the user to tell you the language,

I let him do that

> or ...
>
>     you can run a language detector. They are less accurate for short
> strings, or ...

and when he doesn't, I use lookups and fallback (for wild queries) to a 
crude language detector to guess the field
-> my other mail for sugesters
>       you can process it in _all_ of the languages and OR up the results.

as a last resort, I do that
>
>
>
>> On 4/6/2014 4:51 AM, Benson Margulies wrote:
>>
>>> You must know what language each text is in, and use an appropriate
>>> analyzer. Some people do this by using a separate field (text_eng,
>>> text_spa, text_jpn). Other people put some extra information at the
>>> beginning of the field, and then make an analyzer that peeks in order to
>>> dispatch to the correct tokenizer.
>>>
>>>
>>> On Sat, Apr 5, 2014 at 9:59 PM, <j7a42e4fd7qux@softbank.ne.jp> wrote:
>>>
>>>   I am pretty new with Lucene, however I have not problem understanding
>>>> what
>>>> is about.
>>>> My big problem is trying to understand how Kuromoji works. I need to
>>>> implement a search functinality thats supports initially English, Spanish
>>>> and Japanese. I doesn't seem to be a deal with the two firsts, as I can
>>>> just use the analyzersーcommon to index both languages contents, but when
>>>> it
>>>> comes to Japanese it has it's own analyzer. I could't find any clues
>>>> about
>>>> combining analyzers, so I still don't if I can combine all languages
>>>> under
>>>> the same index (which would be ideal, as I expect mix searches in the
>>>> context of my project) or I have to detect the language first and then
>>>> index Japanese texts separately (what it will be a big disadvantage when
>>>> it
>>>> comes to mixed searches and future localization expansion).
>>>> I found out about Lucene throgh Kuromoji, it will be great to find out a
>>>> solution to be able to use all the greatness that Lucene offers.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message