lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Proposition of a new feature: Dynamic Field Types
Date Sun, 02 Mar 2008 02:38:04 GMT
I don't quite follow everything here (examples?), but I believe IDF of a term is not a per-field
value, but "index-wide".  Does that change the arguments for this proposal then?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: "nicolas.dessaigne@arisem.com" <nicolas.dessaigne@arisem.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, February 29, 2008 11:52:07 AM
> Subject: RE: Proposition of a new feature: Dynamic Field Types
> 
> Thanks for your response Grant.
> 
> You are right, depending of the language we could index the text in a
> specific field. At request time, we would then ask all the fields for the
> query.
> 
> I see however a few possible problems with this approach. By order of
> decreasing importance:
> 
> - Influence on relevance
> 
> I assume the idf is calculated on a field by field basis? In the context of
> one field per language, the documents whose language is the less present in
> the index will receive an unusual boost for cross-lingual tokens. This
> situation can be quite frequent as the distribution of languages in the
> index is usually heterogeneous. Even if it was homogeneous, we would have
> the problem with rare text in one language citing words in another.
> 
> On the other hand, you are right in the sense that the idf of language
> specific words is also altered. In the context of one field for all
> languages, the idf could be very low for a word if it is a common word in
> another language. For example, the world "thé" in French is quite rare, but
> its idf would be greatly altered by the word "the" in English.
> 
> We have a dilemma here...
> 
> - Performance
> 
> Queries are in O(log n) if I'm not mistaken? Then a disjunction query on x
> language fields would be nearly x times slower, no?
> 
> - Verbose configuration
> 
> Not an important point, but with the dynamic field type, you configure only
> one time all the languages. Otherwise, you must do so for each text field.
> 
> The query handler configuration would also be much more verbose. We usually
> use the dismax handler and the qf could become very long.
> 
> - Highlight
> 
> Not an important point either, but a bit of work need to be done to
> aggregate the results.
> 
> In conclusion, the choice is not so clear for me. Your remark on the
> relevance made me think a bit more on multilingual problems. There may be a
> way to tune the idf of some fields depending on others?
> 
> Another idea would be to boost documents in the language of the request.
> This may be actually much simpler.
> 
> If you have any idea on the subject I'm very interested!
> 
> Nicolas
> 
> 
> -----Message d'origine-----
> De : Grant Ingersoll [mailto:gsingers@apache.org] 
> Envoyé : vendredi 29 février 2008 14:06
> À : solr-user@lucene.apache.org
> Objet : Re: Proposition of a new feature: Dynamic Field Types
> 
> Why can't you choose the proper field in your application and keep  
> separate fields per language?  Putting them all in the same field,  
> regardless of language, is not a good idea in my opinion because it is  
> more than likely going to skew your statistics and lower your relevance.
> 
> That being said, the dynamic field type is still an interesting idea.
> 
> -Grant
> 
> On Feb 29, 2008, at 5:56 AM, nicolas.dessaigne@arisem.com wrote:
> 
> > Dynamic field types are field types that act as proxies to other field
> > types. The choice of the field type to use is done on a per document  
> > basis
> > and is dependent of the values of the document's fields.
> >
> > The use case that led us to this feature is the indexation of  
> > documents in
> > different languages. We use a specific analyzer for each language  
> > but want
> > to index semantic information that is not specific to the language.
> >
> > For example, we would add in the index the semantic tag {co:Paris}  
> > for the
> > expressions "Paris", "capital city of France", "the city of lights" in
> > English and "Paris", "capitale de la France", "la ville lumière" in  
> > French.
> > This allows us to provide advanced functionalities such as semantic  
> > and
> > cross-lingual search.
> >
> > To do so in SOLR, we chose to index texts written in different  
> > languages in
> > the same field, while analyzing them with different analyzers. Hence  
> > the
> > proposition of a new feature that respond to this need: Dynamic  
> > Field Types.
> >
> > The idea of this new field type is to act as a proxy to other field  
> > types.
> > Depending of the values of some fields of the document to index, it  
> > chooses
> > the correct field type to use. In our situation, we use it to choose  
> > the
> > correct language dependent field type based on the value of the  
> > field named
> > "language". It is configured with a config similar to the following:
> >
> >     
> >     ...
> >     
> >
> >     
> >     ...
> >     
> >
> >     
> >         
> >             
> > name="french_ft"/>
> >             
> > name="english_ft"/>
> >             
> >         
> >     
> >
> > The last condition is used as a catch-all if preceding conditions  
> > are not
> > met.
> >
> > What do you think of this feature?
> >
> > Best regards,
> > Nicolas Dessaigne
> 
> 
> 
> 
> 
> 



Mime
View raw message