lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sunnyfr <johanna...@gmail.com>
Subject Re: Multi-language solr1.3 what would you reckon?
Date Sun, 19 Oct 2008 19:59:01 GMT

Hi,

Just a question, I thought if I write <analyzer> without defining any type
like index or query, it would apply it for both, isn't it ?

thanks,


John E. McBride wrote:
> 
> In your schema you define each field as follows:
> 
> <fieldtype name="text_it" class="solr.TextField">
> −
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StandardFilterFactory"/>
> <filter class="solr.ISOLatin1AccentFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="Italian"/>
> </analyzer>
> </fieldtype>
> 
> etc
> 
> However, you have not defined the query filters - if you do not this 
> then you will not get any matches for searches in different languages.
> 
> for example, in english if you index the sentence "the joyful boy played 
> tennis", this would typically get stored as "joy boy play tennis" due to 
> the analysis filters. If you then made a query for "joyful" without 
> applying the same filters on the query side you would get no matches.
> 
> You will also want to get some multilingual stop words lists from 
> snowball website eg
> http://snowball.tartarus.org/algorithms/german/stop.txt.
> 
> sunnyfr wrote:
>> What is the problem with the way that I've done, 
>> Does that's means that there is some which are linked with language that
>> we
>> won't manage by search,
>> there is too many language, the application will be for video,
>> we will manage around 10 language, but in our database we have around  25
>> language, 
>> Should i create a core text and others like text_en, text_fr, text_es,
>> and
>> all the video which are not in this language manage by the search engine
>> should be stored in text ?
>>
>> Because even if they are on the english website they should be able if
>> they
>> enter a french word "chien" for "dog"
>> to find french videos.
>> I don't know if I'm clear??
>>
>> and even so text should manage all the other language which are not
>> managed
>> in the other cores ?? 
>>
>> thanks
>>
>>
>> John E. McBride wrote:
>>   
>>> Well, it's this section shown below, which would change from geography 
>>> to geography.
>>> Parameterise the EnglishPorterFilterFactory and protwords.
>>>
>>> You could introduce logic in the front end which asks if num results is 
>>> zero then makes a call to the english language, but it doesn't make 
>>> logical sense?  why would a search in the italian language bring up 
>>> anything in the english index?
>>>
>>> I think you need to explain your application in a little more detail.
>>>
>>>
>>> <fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100">
>>> -
>>> <analyzer type="index">
>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> -
>>> <!--
>>>  in this example, we will only use synonyms at query time
>>>         <filter class="solr.SynonymFilterFactory" 
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>        
>>> -->
>>> -
>>> <!--
>>>  Case insensitive stop word removal.
>>>              enablePositionIncrements=true ensures that a 'gap' is left
>>> to
>>>              allow for accurate phrase queries.
>>>        
>>> -->
>>> <filter class="solr.StopFilterFactory" ignoreCase="true" 
>>> words="stopwords.txt" enablePositionIncrements="true"/>
>>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
>>> catenateAll="0" splitOnCaseChange="1"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.EnglishPorterFilterFactory"
>>> protected="protwords.txt"/>
>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>> </analyzer>
>>> -
>>> <analyzer type="query">
>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
>>> ignoreCase="true" expand="true"/>
>>> <filter class="solr.StopFilterFactory" ignoreCase="true" 
>>> words="stopwords.txt"/>
>>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
>>> generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
>>> catenateAll="0" splitOnCaseChange="1"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.EnglishPorterFilterFactory"
>>> protected="protwords.txt"/>
>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> sunnyfr wrote:
>>>     
>>>> Hi,
>>>>
>>>> Thanks guys for your answer, but I don't think I can use multi-core for
>>>> each
>>>> language, 
>>>> because for exemple if somebody is connected from Italia and if there
>>>> is
>>>> not
>>>> that much Italian's book,
>>>> so by default I will show up few italian books but all the english one
>>>> as
>>>> well.
>>>>
>>>> Do you have an example ? 
>>>> I'm quite lost about it,
>>>>
>>>>
>>>> John E. McBride wrote:
>>>>   
>>>>       
>>>>> Fairly nebulous requirements, but I recently was involved in a 
>>>>> multilingual search platform.
>>>>>
>>>>> The approach, translated to solr 1.3 would be to use multicore - one

>>>>> core per geography.  Then a schema.xml per core, each with a different

>>>>> language in the porter algorithm, stopwords etc - taken from snowball.
>>>>>
>>>>> Then on the german front end you make requests to the de core, on the

>>>>> english front end make requests to the english core.
>>>>>
>>>>> This is much simpler than sorting every language in the one index, for

>>>>> example german queries will need to be run through the german query 
>>>>> filters etc.  If you have all languages in one schema, then you will

>>>>> have to do some front end logic to map the query to the correct field.
>>>>>
>>>>> You have failed to consider internationalisation of the query side of

>>>>> the process - your field type merely have analysis filters. 
>>>>>
>>>>> Additionally, if the data source for each different geography is 
>>>>> different it makes sense to separate the indexes and subsequently the

>>>>> ingestion mechanisms and schedules.
>>>>>
>>>>> Just a few thoughts.
>>>>>
>>>>> John
>>>>>
>>>>> sunnyfr wrote:
>>>>>     
>>>>>         
>>>>>> Hi,
>>>>>>
>>>>>> I would like to manage properly multi language search motor,
>>>>>> I would like your advice about what have I done.
>>>>>>
>>>>>> Solr1.3
>>>>>> tomcat55
>>>>>>
>>>>>> http://www.nabble.com/file/p19954805/schema.xml schema.xml 
>>>>>>
>>>>>> Thanks a lot,
>>>>>>
>>>>>>   
>>>>>>       
>>>>>>           
>>>>>     
>>>>>         
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Multi-language-solr1.3-what-would-you-reckon--tp19954805p20059666.html
Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message