lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bess Sadler <>
Subject Re: Internationalization
Date Tue, 16 Jan 2007 16:48:38 GMT
Hi, Jörg.

At the Tibetan Himalayan Digital Library, we are working with XML  
files that have fields that might be in Tibetan, Chinese, Nepalese,  
or English. Our solr schema.xml file looks like this:

    <dynamicField name="*_eng" type="string"    indexed="true"   
stored="true" multiValued="true"/>
    <dynamicField name="*_chi" type="string"    indexed="true"   
stored="true" multiValued="true"/>
    <dynamicField name="*_tib" type="string"    indexed="true"   
stored="true" multiValued="true"/>
    <dynamicField name="*_nep" type="string" indexed="true"  
stored="true" multiValued="true"/>

I run all of our XML data through a XSL transformation that puts it  
in solr indexable form and also figures out what language a field is  
in and gives it an appropriate name, e.g., "location_eng" or  
"formalname_tib". So far this is working very well for us.

Currently, we are assigning all fields, no matter what language to  
type string, defined as

<fieldtype name="string" class="solr.StrField" sortMissingLast="true"/>

This does string matching very well, but doesn't do any stop words,  
or stemming, or anything fancy. We are toying with the idea of a  
custom Tibetan indexer to better break up the Tibetan into discrete  
words, but for this particular project (because it mostly has to do  
with proper names, not long passages of text) this hasn't been a  
problem yet, and the above solution seems to be doing the trick.

I hope this helps.

Good luck!


On Jan 16, 2007, at 10:23 AM, Jörg Pfründer wrote:

> Hello,
> is there anyone who has experience on internationalization  
> (internationalisation) with SOLR?
> How do you setup a multi language data index?  Should we use a  
> dynamic field like text_en, text_fr, text_es?
> Is there a GermanPorterFilterFactory or FrenchPorterFilterFactory?
> Thank you very much.
> Jörg Pfründer
> _____________________________________________________
> Gratis Emailpostfach mit 2 GB Speicher -
> 10 SMS -
> Spam?

Elizabeth (Bess) Sadler
Head, Technical and Metadata Services
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904
(434) 243-2305

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message