lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rishi Easwaran <rishi.easwa...@aol.com>
Subject Re: Basic Multilingual search capability
Date Tue, 24 Feb 2015 04:38:04 GMT
Hi Wunder,

Yes we do expect incoming documents to contain Chinese/Japanese/Arabic languages.

From what you have mentioned, it looks like we need to auto detect the incoming content language
and tokenize/filter after that.
But I thought the ICU tokenizer had capability to do that  (https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-ICUTokenizer)
"This tokenizer processes multilingual text and tokenizes it appropriately based on its script
attribute." 
or am I missing something? 

Thanks,
Rishi.

 

 

-----Original Message-----
From: Walter Underwood <wunder@wunderwood.org>
To: solr-user <solr-user@lucene.apache.org>
Sent: Mon, Feb 23, 2015 11:17 pm
Subject: Re: Basic Multilingual search capability


It isn’t just complicated, it can be impossible.

Do you have content in Chinese or Japanese? Those languages (and some others) do 
not separate words with spaces. You cannot even do word search without a 
language-specific, dictionary-based parser.

German is space separated, except many noun compounds are not space-separated.

Do you have Finnish content? Entire prepositional phrases turn into word 
endings.

Do you have Arabic content? That is even harder.

If all your content is in space-separated languages that are not heavily 
inflected, you can kind of do OK with a language-insensitive approach. But it 
hits the wall pretty fast.

One thing that does work pretty well is trademarked names (LaserJet, Coke, etc). 
Those are spelled the same in all languages and usually not inflected.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Feb 23, 2015, at 8:00 PM, Rishi Easwaran <rishi.easwaran@aol.com> wrote:

> Hi Alex,
> 
> There is no specific language list.  
> For example: the documents that needs to be indexed are emails or any messages 
for a global customer base. The messages back and forth could be in any language 
or mix of languages.
> 
> I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.
> 
> Now it would be great if it had capability to tokenize email addresses 
(ex:hello@aol.com- i think standardTokenizer already does this),  filenames 
(здравствуйте.pdf), but maybe we can use filters to accomplish that. 
> 
> Thanks,
> Rishi.
> 
> -----Original Message-----
> From: Alexandre Rafalovitch <arafalov@gmail.com>
> To: solr-user <solr-user@lucene.apache.org>
> Sent: Mon, Feb 23, 2015 5:49 pm
> Subject: Re: Basic Multilingual search capability
> 
> 
> Which languages are you expecting to deal with? Multilingual support
> is a complex issue. Even if you think you don't need much, it is
> usually a lot more complex than expected, especially around relevancy.
> 
> Regards,
>   Alex.
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
> 
> 
> On 23 February 2015 at 16:19, Rishi Easwaran <rishi.easwaran@aol.com> wrote:
>> Hi All,
>> 
>> For our use case we don't really need to do a lot of manipulation of incoming 

> text during index time. At most removal of common stop words, tokenize emails/ 

> filenames etc if possible. We get text documents from our end users, which can 

> be in any language (sometimes combination) and we cannot determine the 
language 
> of the incoming text. Language detection at index time is not necessary.
>> 
>> Which analyzer is recommended to achive basic multilingual search capability 
> for a use case like this.
>> I have read a bunch of posts about using a combination standardtokenizer or 
> ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for 

> ideas, suggestions, best practices.
>> 
>> http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
>> http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
>> https://issues.apache.org/jira/browse/SOLR-6492
>> 
>> 
>> Thanks,
>> Rishi.
>> 
> 
> 


 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message