lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rishi Easwaran <rishi.easwa...@aol.com>
Subject Re: Basic Multilingual search capability
Date Thu, 26 Feb 2015 20:58:53 GMT
Hi Tom,

Thanks for your inputs. 
I was planning to use stopword filter, but will definitely make sure they are unique and not
to step over each other.  I think for our system even going with length of 50-75 should be
fine, will definitely up that number after doing some analysis on our input.
Just one clarification, when you say ICUFilterFactory am I correct in thinking its ICUFodingFilterFactory.
 
Thanks,
Rishi.

 

 

-----Original Message-----
From: Tom Burton-West <tburtonw@umich.edu>
To: solr-user <solr-user@lucene.apache.org>
Sent: Wed, Feb 25, 2015 4:33 pm
Subject: Re: Basic Multilingual search capability


Hi Rishi,

As others have indicated Multilingual search is very difficult to do well.

At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to
deal with having materials in 400 languages.  We also added the
CJKBigramFilter to get better precision on CJK queries.  We don't use stop
words because stop words in one language are content words in another.  For
example "die" in German is a stopword but it is a content word in English.

Putting multiple languages in one index can affect word frequency
statistics which make relevance ranking less accurate.  So for example for
the English query "Die Hard" the word "die" would get a low idf score
because it occurs so frequently in German.  We realize that our  approach
does not produce the best results, but given the 400 languages, and limited
resources, we do our best to make search "not suck" for non-English
languages.   When we have the resources we are thinking about doing special
processing for a small fraction of the top 20 languages.  We plan to select
those languages  that most need special processing and relatively easy to
disambiguate from other languages.


If you plan on identifying languages (rather than scripts), you should be
aware that most language detection libraries don't work well on short texts
such as queries.

If you know that you have scripts for which you have content in only one
language, you can use script detection instead of language detection.


If you have German, a filter length of 25 might be too low (Because of
compounding). You might want to analyze a sample of your German text to
find a good length.

Tom

http://www.hathitrust.org/blogs/Large-scale-Search


On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran <rishi.easwaran@aol.com>
wrote:

> Hi Alex,
>
> Thanks for the suggestions. These steps will definitely help out with our
> use case.
> Thanks for the idea about the lengthFilter to protect our system.
>
> Thanks,
> Rishi.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Alexandre Rafalovitch <arafalov@gmail.com>
> To: solr-user <solr-user@lucene.apache.org>
> Sent: Tue, Feb 24, 2015 8:50 am
> Subject: Re: Basic Multilingual search capability
>
>
> Given the limited needs, I would probably do something like this:
>
> 1) Put a language identifier in the UpdateRequestProcessor chain
> during indexing and route out at least known problematic languages,
> such as Chinese, Japanese, Arabic into individual fields
> 2) Put everything else together into one field with ICUTokenizer,
> maybe also ICUFoldingFilter
> 3) At the very end of that joint filter, stick in LengthFilter with
> some high number, e.g. 25 characters max. This will ensure that
> super-long words from non-space languages and edge conditions do not
> break the rest of your system.
>
>
> Regards,
>    Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 23 February 2015 at 23:14, Walter Underwood <wunder@wunderwood.org>
> wrote:
> >> I understand relevancy, stemming etc becomes extremely complicated with
> multilingual support, but our first goal is to be able to tokenize and
> provide
> basic search capability for any language. Ex: When the document contains
> hello
> or здравствуйте, the analyzer creates tokens and provides exact match
> search
> results.
>
>
>

 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message