lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: Indexing and searching documents in different languages
Date Tue, 09 Apr 2013 18:35:30 GMT
Hi,

Typically people try to figure out the query language somehow.
Queries are short, so LID on them is hard.  But user profile could
indicate a language, or users can be asked and such.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Tue, Apr 9, 2013 at 2:32 PM,  <dev@geschan.de> wrote:
>
> Hello,
>
> I'm trying to index a large number of documents in different languages.
> I don't know the language of the document, so I'm using
> TikaLanguageIdentifierUpdateProcessorFactory to identify it.
>
> So, this is my configuration in solrconfig.xml
>
>  <updateRequestProcessorChain name="langid">
>    <processor
> class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
>          <bool name="langid">true</bool>
>          <str name="langid.fl">title,subtitle,content</str>
>          <str name="langid.langField">language_s</str>
>          <str name="langid.threshold">0.3</str>
>          <str name="langid.fallback">general</str>
>          <str name="langid.whitelist">en,fr,de,it,es</str>
>          <bool name="langid.map">true</bool>
>          <bool name="langid.map.keepOrig">true</bool>
>    </processor>
>    <processor class="solr.LogUpdateProcessorFactory" />
>    <processor class="solr.RunUpdateProcessorFactory" />
>  </updateRequestProcessorChain>
>
> So, the detection works fine and I put some dynamic fields in schema.xml to
> store the results:
>   <dynamicField name="*_en"  type="text_en"    indexed="true"  stored="true"
> multiValued="true"/>
>   <dynamicField name="*_fr"  type="text_fr"    indexed="true"  stored="true"
> multiValued="true"/>
>   <dynamicField name="*_de"  type="text_de"    indexed="true"  stored="true"
> multiValued="true"/>
>   <dynamicField name="*_it"  type="text_it"    indexed="true"  stored="true"
> multiValued="true"/>
>   <dynamicField name="*_es"  type="text_es"    indexed="true"  stored="true"
> multiValued="true"/>
>
> My main problem now is how to search the document without knowing the
> language of the searched document.
> I don't want to have a huge querystring like
> ?q=title_en:+term+subtitle_en:+term+title_de:+term...
> Okay, using copyField and copy all fields into the "text" field...but "text"
> has the type text_general, so the language specific indexing is not working.
> I could use at least a combined field for every language (like text_en,
> text_fr...) but still, my querystring gets very long and to add new
> languages is terribly uncomfortable.
>
> So, what can I do? Is there a better solution to index and search documents
> in many languages without knowing the language of the document and the query
> before?
>
> - Geschan
>

Mime
View raw message