Return-Path: Delivered-To: apmail-lucene-solr-user-archive@locus.apache.org Received: (qmail 11832 invoked from network); 2 Mar 2008 02:38:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Mar 2008 02:38:43 -0000 Received: (qmail 28843 invoked by uid 500); 2 Mar 2008 02:38:32 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 28802 invoked by uid 500); 2 Mar 2008 02:38:32 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 28793 invoked by uid 99); 2 Mar 2008 02:38:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Mar 2008 18:38:32 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [206.190.38.57] (HELO web50303.mail.re2.yahoo.com) (206.190.38.57) by apache.org (qpsmtpd/0.29) with SMTP; Sun, 02 Mar 2008 02:37:56 +0000 Received: (qmail 624 invoked by uid 60001); 2 Mar 2008 02:38:05 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID; b=x5EOx55OCwjtpjtblwkj5pWwlzWgPUTEdWzkN+kvdCu1++tYYtwo2d+AFk3hytvX3buNf5kK/o1Sa2XnRKsOsCwCKv7muSJj2KsaVAhX+gScKTF9H4BDp2MCmnH/rrjXQNML9uBzxSKbR+Z5HE4JHQ4WE4WpfiFeZ4fK6qkf1kM=; X-YMail-OSG: JbUJdTAVM1kcLxPVfleLG2JhrYYk_DIsw69KcOSUIHbEF1nF8eUjhiDg5o5EvpzffFIz_OzeOO.0NAuyuqga9Xzx5oFpe8taPDMG_x8MXfOX4sFH9pd564ub7f3lFP9ghxnSnXC.SnQ4v_8- Received: from [72.231.9.236] by web50303.mail.re2.yahoo.com via HTTP; Sat, 01 Mar 2008 18:38:04 PST X-Mailer: YahooMailRC/902.35 YahooMailWebService/0.7.162 Date: Sat, 1 Mar 2008 18:38:04 -0800 (PST) From: Otis Gospodnetic Subject: Re: Proposition of a new feature: Dynamic Field Types To: solr-user@lucene.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Message-ID: <68538.99431.qm@web50303.mail.re2.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org I don't quite follow everything here (examples?), but I believe IDF of a te= rm is not a per-field value, but "index-wide". Does that change the argume= nts for this proposal then?=0A=0AOtis=0A--=0ASematext -- http://sematext.co= m/ -- Lucene - Solr - Nutch=0A=0A----- Original Message ----=0A> From: "nic= olas.dessaigne@arisem.com" =0A> To: solr-user= @lucene.apache.org=0A> Sent: Friday, February 29, 2008 11:52:07 AM=0A> Subj= ect: RE: Proposition of a new feature: Dynamic Field Types=0A> =0A> Thanks = for your response Grant.=0A> =0A> You are right, depending of the language = we could index the text in a=0A> specific field. At request time, we would = then ask all the fields for the=0A> query.=0A> =0A> I see however a few pos= sible problems with this approach. By order of=0A> decreasing importance:= =0A> =0A> - Influence on relevance=0A> =0A> I assume the idf is calculated = on a field by field basis? In the context of=0A> one field per language, th= e documents whose language is the less present in=0A> the index will receiv= e an unusual boost for cross-lingual tokens. This=0A> situation can be quit= e frequent as the distribution of languages in the=0A> index is usually het= erogeneous. Even if it was homogeneous, we would have=0A> the problem with = rare text in one language citing words in another.=0A> =0A> On the other ha= nd, you are right in the sense that the idf of language=0A> specific words = is also altered. In the context of one field for all=0A> languages, the idf= could be very low for a word if it is a common word in=0A> another languag= e. For example, the world "th=E9" in French is quite rare, but=0A> its idf = would be greatly altered by the word "the" in English.=0A> =0A> We have a d= ilemma here...=0A> =0A> - Performance=0A> =0A> Queries are in O(log n) if I= 'm not mistaken? Then a disjunction query on x=0A> language fields would be= nearly x times slower, no?=0A> =0A> - Verbose configuration=0A> =0A> Not a= n important point, but with the dynamic field type, you configure only=0A> = one time all the languages. Otherwise, you must do so for each text field.= =0A> =0A> The query handler configuration would also be much more verbose. = We usually=0A> use the dismax handler and the qf could become very long.=0A= > =0A> - Highlight=0A> =0A> Not an important point either, but a bit of wor= k need to be done to=0A> aggregate the results.=0A> =0A> In conclusion, the= choice is not so clear for me. Your remark on the=0A> relevance made me th= ink a bit more on multilingual problems. There may be a=0A> way to tune the= idf of some fields depending on others?=0A> =0A> Another idea would be to = boost documents in the language of the request.=0A> This may be actually mu= ch simpler.=0A> =0A> If you have any idea on the subject I'm very intereste= d!=0A> =0A> Nicolas=0A> =0A> =0A> -----Message d'origine-----=0A> De : Gran= t Ingersoll [mailto:gsingers@apache.org] =0A> Envoy=E9 : vendredi 29 f=E9vr= ier 2008 14:06=0A> =C0 : solr-user@lucene.apache.org=0A> Objet : Re: Propos= ition of a new feature: Dynamic Field Types=0A> =0A> Why can't you choose t= he proper field in your application and keep =0A> separate fields per lang= uage? Putting them all in the same field, =0A> regardless of language, is= not a good idea in my opinion because it is =0A> more than likely going t= o skew your statistics and lower your relevance.=0A> =0A> That being said, = the dynamic field type is still an interesting idea.=0A> =0A> -Grant=0A> = =0A> On Feb 29, 2008, at 5:56 AM, nicolas.dessaigne@arisem.com wrote:=0A> = =0A> > Dynamic field types are field types that act as proxies to other fie= ld=0A> > types. The choice of the field type to use is done on a per docume= nt =0A> > basis=0A> > and is dependent of the values of the document's fie= lds.=0A> >=0A> > The use case that led us to this feature is the indexation= of =0A> > documents in=0A> > different languages. We use a specific analy= zer for each language =0A> > but want=0A> > to index semantic information = that is not specific to the language.=0A> >=0A> > For example, we would add= in the index the semantic tag {co:Paris} =0A> > for the=0A> > expressions= "Paris", "capital city of France", "the city of lights" in=0A> > English a= nd "Paris", "capitale de la France", "la ville lumi=E8re" in =0A> > French= .=0A> > This allows us to provide advanced functionalities such as semantic= =0A> > and=0A> > cross-lingual search.=0A> >=0A> > To do so in SOLR, we c= hose to index texts written in different =0A> > languages in=0A> > the sam= e field, while analyzing them with different analyzers. Hence =0A> > the= =0A> > proposition of a new feature that respond to this need: Dynamic =0A= > > Field Types.=0A> >=0A> > The idea of this new field type is to act as a= proxy to other field =0A> > types.=0A> > Depending of the values of some = fields of the document to index, it =0A> > chooses=0A> > the correct field= type to use. In our situation, we use it to choose =0A> > the=0A> > corre= ct language dependent field type based on the value of the =0A> > field na= med=0A> > "language". It is configured with a config similar to the followi= ng:=0A> >=0A> > =0A> > ...=0A> > =0A> >=0A> > =0A> > ..= .=0A> > =0A> >=0A> > =0A> > =0A> > =0A> > name= =3D"french_ft"/>=0A> > =0A> > name=3D"english_ft"/>=0A> > = =0A> > =0A> > =0A> >=0A> > The last condition is used as= a catch-all if preceding conditions =0A> > are not=0A> > met.=0A> >=0A> >= What do you think of this feature?=0A> >=0A> > Best regards,=0A> > Nicolas= Dessaigne=0A> =0A> =0A> =0A> =0A> =0A> =0A=0A