lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Luis Betancourt Gonzalez <jlbetanco...@uci.cu>
Subject Re: How to implement multilingual word components fields schema?
Date Mon, 08 Sep 2014 14:31:37 GMT
In one of the talks by Trey Grainger (author of Solr in Action) it touches how on CareerBuilder
are dealing with multilingual with payloads, its a little more of work but I think it would
payoff. 

On Sep 8, 2014, at 7:58 AM, Jack Krupansky <jack@basetechnology.com> wrote:

> You also need to take a stance as to whether you wish to auto-detect the language at
query time vs. have a UI selection of language vs. attempt to perform the same query for each
available language and then "determine" which has the best "relevancy". The latter two options
are very sensitive to short queries. Keep in mind that auto-detection for indexing full documents
is a different problem that auto-detection for very short queries.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Ilia Sretenskii
> Sent: Sunday, September 7, 2014 10:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to implement multilingual word components fields schema?
> 
> Thank you for the replies, guys!
> 
> Using field-per-language approach for multilingual content is the last
> thing I would try since my actual task is to implement a search
> functionality which would implement relatively the same possibilities for
> every known world language.
> The closest references are those popular web search engines, they seem to
> serve worldwide users with their different languages and even
> cross-language queries as well.
> Thus, a field-per-language approach would be a sure waste of storage
> resources due to the high number of duplicates, since there are over 200
> known languages.
> I really would like to keep single field for cross-language searchable text
> content, witout splitting it into specific language fields or specific
> language cores.
> 
> So my current choice will be to stay with just the ICUTokenizer and
> ICUFoldingFilter as they are without any language specific
> stemmers/lemmatizers yet at all.
> 
> Probably I will put the most popular languages stop words filters and
> stemmers into the same one searchable text field to give it a try and see
> if it works correctly in a stack.
> Does specific language related filters stacking work correctly in one field?
> 
> Further development will most likely involve some advanced custom analyzers
> like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
> ScriptAttribute.
> http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
> https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java
> 
> So I would like to know more about those "academic papers on this issue of
> how best to deal with mixed language/mixed script queries and documents".
> Tom, could you please share them? 

Concurso "Mi selfie por los 5". Detalles en http://justiciaparaloscinco.wordpress.com

Mime
View raw message