Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID;
  b=x5EOx55OCwjtpjtblwkj5pWwlzWgPUTEdWzkN+kvdCu1++tYYtwo2d+AFk3hytvX3buNf5kK/o1Sa2XnRKsOsCwCKv7muSJj2KsaVAhX+gScKTF9H4BDp2MCmnH/rrjXQNML9uBzxSKbR+Z5HE4JHQ4WE4WpfiFeZ4fK6qkf1kM=;
Date: Sat, 1 Mar 2008 18:38:04 -0800 (PST)
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Subject: Re: Proposition of a new feature: Dynamic Field Types
To: solr-user@lucene.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Message-ID: <68538.99431.qm@web50303.mail.re2.yahoo.com>

I don't quite follow everything here (examples?), but I believe IDF of a te=
rm is not a per-field value, but "index-wide".  Does that change the argume=
nts for this proposal then?=0A=0AOtis=0A--=0ASematext -- http://sematext.co=
m/ -- Lucene - Solr - Nutch=0A=0A----- Original Message ----=0A> From: "nic=
olas.dessaigne@arisem.com" <nicolas.dessaigne@arisem.com>=0A> To: solr-user=
@lucene.apache.org=0A> Sent: Friday, February 29, 2008 11:52:07 AM=0A> Subj=
ect: RE: Proposition of a new feature: Dynamic Field Types=0A> =0A> Thanks =
for your response Grant.=0A> =0A> You are right, depending of the language =
we could index the text in a=0A> specific field. At request time, we would =
then ask all the fields for the=0A> query.=0A> =0A> I see however a few pos=
sible problems with this approach. By order of=0A> decreasing importance:=
=0A> =0A> - Influence on relevance=0A> =0A> I assume the idf is calculated =
on a field by field basis? In the context of=0A> one field per language, th=
e documents whose language is the less present in=0A> the index will receiv=
e an unusual boost for cross-lingual tokens. This=0A> situation can be quit=
e frequent as the distribution of languages in the=0A> index is usually het=
erogeneous. Even if it was homogeneous, we would have=0A> the problem with =
rare text in one language citing words in another.=0A> =0A> On the other ha=
nd, you are right in the sense that the idf of language=0A> specific words =
is also altered. In the context of one field for all=0A> languages, the idf=
 could be very low for a word if it is a common word in=0A> another languag=
e. For example, the world "th=E9" in French is quite rare, but=0A> its idf =
would be greatly altered by the word "the" in English.=0A> =0A> We have a d=
ilemma here...=0A> =0A> - Performance=0A> =0A> Queries are in O(log n) if I=
'm not mistaken? Then a disjunction query on x=0A> language fields would be=
 nearly x times slower, no?=0A> =0A> - Verbose configuration=0A> =0A> Not a=
n important point, but with the dynamic field type, you configure only=0A> =
one time all the languages. Otherwise, you must do so for each text field.=
=0A> =0A> The query handler configuration would also be much more verbose. =
We usually=0A> use the dismax handler and the qf could become very long.=0A=
> =0A> - Highlight=0A> =0A> Not an important point either, but a bit of wor=
k need to be done to=0A> aggregate the results.=0A> =0A> In conclusion, the=
 choice is not so clear for me. Your remark on the=0A> relevance made me th=
ink a bit more on multilingual problems. There may be a=0A> way to tune the=
 idf of some fields depending on others?=0A> =0A> Another idea would be to =
boost documents in the language of the request.=0A> This may be actually mu=
ch simpler.=0A> =0A> If you have any idea on the subject I'm very intereste=
d!=0A> =0A> Nicolas=0A> =0A> =0A> -----Message d'origine-----=0A> De : Gran=
t Ingersoll [mailto:gsingers@apache.org] =0A> Envoy=E9 : vendredi 29 f=E9vr=
ier 2008 14:06=0A> =C0 : solr-user@lucene.apache.org=0A> Objet : Re: Propos=
ition of a new feature: Dynamic Field Types=0A> =0A> Why can't you choose t=
he proper field in your application and keep  =0A> separate fields per lang=
uage?  Putting them all in the same field,  =0A> regardless of language, is=
 not a good idea in my opinion because it is  =0A> more than likely going t=
o skew your statistics and lower your relevance.=0A> =0A> That being said, =
the dynamic field type is still an interesting idea.=0A> =0A> -Grant=0A> =
=0A> On Feb 29, 2008, at 5:56 AM, nicolas.dessaigne@arisem.com wrote:=0A> =
=0A> > Dynamic field types are field types that act as proxies to other fie=
ld=0A> > types. The choice of the field type to use is done on a per docume=
nt  =0A> > basis=0A> > and is dependent of the values of the document's fie=
lds.=0A> >=0A> > The use case that led us to this feature is the indexation=
 of  =0A> > documents in=0A> > different languages. We use a specific analy=
zer for each language  =0A> > but want=0A> > to index semantic information =
that is not specific to the language.=0A> >=0A> > For example, we would add=
 in the index the semantic tag {co:Paris}  =0A> > for the=0A> > expressions=
 "Paris", "capital city of France", "the city of lights" in=0A> > English a=
nd "Paris", "capitale de la France", "la ville lumi=E8re" in  =0A> > French=
.=0A> > This allows us to provide advanced functionalities such as semantic=
  =0A> > and=0A> > cross-lingual search.=0A> >=0A> > To do so in SOLR, we c=
hose to index texts written in different  =0A> > languages in=0A> > the sam=
e field, while analyzing them with different analyzers. Hence  =0A> > the=
=0A> > proposition of a new feature that respond to this need: Dynamic  =0A=
> > Field Types.=0A> >=0A> > The idea of this new field type is to act as a=
 proxy to other field  =0A> > types.=0A> > Depending of the values of some =
fields of the document to index, it  =0A> > chooses=0A> > the correct field=
 type to use. In our situation, we use it to choose  =0A> > the=0A> > corre=
ct language dependent field type based on the value of the  =0A> > field na=
med=0A> > "language". It is configured with a config similar to the followi=
ng:=0A> >=0A> >     =0A> >     ...=0A> >     =0A> >=0A> >     =0A> >     ..=
.=0A> >     =0A> >=0A> >     =0A> >         =0A> >             =0A> > name=
=3D"french_ft"/>=0A> >             =0A> > name=3D"english_ft"/>=0A> >      =
       =0A> >         =0A> >     =0A> >=0A> > The last condition is used as=
 a catch-all if preceding conditions  =0A> > are not=0A> > met.=0A> >=0A> >=
 What do you think of this feature?=0A> >=0A> > Best regards,=0A> > Nicolas=
 Dessaigne=0A> =0A> =0A> =0A> =0A> =0A> =0A=0A