lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Susheel Kumar <susheel.ku...@thedigitalgroup.net>
Subject RE: How to implement multilingual word components fields schema?
Date Fri, 05 Sep 2014 20:53:46 GMT
Agree with the approach Jack suggested to use same source text in multiple fields for each
language and then doing a dismax query.  Would love to hear if it works for you?

Thanks,
Susheel

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: Friday, September 05, 2014 10:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to implement multilingual word components fields schema?

It comes down to how you personally want to value compromises between conflicting requirements,
such as relative weighting of false positives and false negatives. Provide a few use cases
that illustrate the boundary cases that you care most about. For example field values that
have snippets in one language embedded within larger values in a different language. And,
whether your fields are always long or sometimes short - the former can work well for language
detection, but not the latter, unless all fields of a given document are always in the same
language.

Otherwise simply index the same source text in multiple fields, one for each language. You
can then do a dismax query on that set of fields.

-- Jack Krupansky

-----Original Message-----
From: Ilia Sretenskii
Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different languages parts and seach
queries of the same complexity, and it is a worldwide used online application, so users generate
content in all the possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would probably utilize
the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their commercial plugins and it
defines tokenizer/filter language per field type, which is not a universal solution for such
complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

This e-mail message may contain confidential or legally privileged information and is intended
only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination,
distribution, copying or the taking of any action in reliance on the information herein is
prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be
intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed
to have accepted these risks. The Digital Group is not responsible for errors or omissions
in this message and denies any responsibility for any damage arising from the use of e-mail.
Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably
branded to be a species of plagiarism and other statements contained in this message and any
attachment are solely those of the author and do not necessarily represent those of the company.
Mime
View raw message