lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From d..@geschan.de
Subject Indexing and searching documents in different languages
Date Tue, 09 Apr 2013 18:32:26 GMT

Hello,

I'm trying to index a large number of documents in different languages.
I don't know the language of the document, so I'm using  
TikaLanguageIdentifierUpdateProcessorFactory to identify it.

So, this is my configuration in solrconfig.xml

  <updateRequestProcessorChain name="langid">
    <processor  
class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
	 <bool name="langid">true</bool>
	 <str name="langid.fl">title,subtitle,content</str>
	 <str name="langid.langField">language_s</str>
	 <str name="langid.threshold">0.3</str>
	 <str name="langid.fallback">general</str>
	 <str name="langid.whitelist">en,fr,de,it,es</str>
	 <bool name="langid.map">true</bool>
	 <bool name="langid.map.keepOrig">true</bool>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

So, the detection works fine and I put some dynamic fields in  
schema.xml to store the results:
   <dynamicField name="*_en"  type="text_en"    indexed="true"   
stored="true" multiValued="true"/>
   <dynamicField name="*_fr"  type="text_fr"    indexed="true"   
stored="true" multiValued="true"/>
   <dynamicField name="*_de"  type="text_de"    indexed="true"   
stored="true" multiValued="true"/>
   <dynamicField name="*_it"  type="text_it"    indexed="true"   
stored="true" multiValued="true"/>
   <dynamicField name="*_es"  type="text_es"    indexed="true"   
stored="true" multiValued="true"/>

My main problem now is how to search the document without knowing the  
language of the searched document.
I don't want to have a huge querystring like   
?q=title_en:+term+subtitle_en:+term+title_de:+term...
Okay, using copyField and copy all fields into the "text" field...but  
"text" has the type text_general, so the language specific indexing is  
not working. I could use at least a combined field for every language  
(like text_en, text_fr...) but still, my querystring gets very long  
and to add new languages is terribly uncomfortable.

So, what can I do? Is there a better solution to index and search  
documents in many languages without knowing the language of the  
document and the query before?

- Geschan


Mime
View raw message