Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of lists@nabble.com designates
 216.139.236.158 as permitted sender)
Message-ID: <15736946.post@talk.nabble.com>
Date: Thu, 28 Feb 2008 06:00:10 -0800 (PST)
From: Dharmalingam <dganesan@fc-md.umd.edu>
To: java-user@lucene.apache.org
Subject: Re: Vector Space Model: New Similarity Implementation Issues
In-Reply-To: <003437AD-2007-41B7-9A67-B47CD7CE1E62@apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
References: <15696719.post@talk.nabble.com>
 <003437AD-2007-41B7-9A67-B47CD7CE1E62@apache.org>


Thanks for the reply. Sorry if my explanation is not clear. Yes, you are
correct the model is based on  Salton's VSM. However, the calculation of th=
e
term weight and the doc norm is, in my opinion, different from Lucene. If
you look at the table given in
http://www.miislita.com/term-vector/term-vector-3.html, they calcuate the
document norm based on the weight wi=3Dtfi*idfi. I looked at the interfaces=
 of
Similarity and DefaultSimilairty class. I place it below:

public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
 }

You can see that this lengthNorm for a doc is quite different from that
website norm calculation.

Similarly, the querynorm interface of DefaultSimilarity class is:

 /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
  public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
  }

This is again different the website model.

I also have difficulities with tf interface of DefaultSimilarity:=20
/** Implemented as <code>sqrt(freq)</code>. */
  public float tf(float freq) {
    return (float)Math.sqrt(freq);
  }

In that website model, a tf refers to the frequency of a term within a doc.

I hope explained it better. Please let me know if it is unclear. I am
looking for an easy way to implement that table, and of course want to
integrate with my lucene (  i.e., myIndexWriter.setSimilarity(new
mySimilarity());) Will this be possible by just somehow inheriting the base
classes of Lucene.

Thanks for your advice.

Grant Ingersoll-6 wrote:
>=20
> Not sure I am understanding what you are asking, but I will give it a =20
> shot.   See below
>=20
>=20
> On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote:
>=20
>>
>> Hi List,
>>
>> I am pretty new to Lucene. Certainly, it is very exciting. I need to
>> implement a new Similarity class based on the Term Vector Space =20
>> Model given
>> in http://www.miislita.com/term-vector/term-vector-3.html
>>
>> Although that model is similar to Lucene=E2=80=99s model
>> (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apa=
che/lucene/search/Similarity.html=20
>> ),
>> I am having hard time to extend the Similarity class to calculate that
>> model.
>>
>> In that model, =E2=80=9Ctf=E2=80=9D is multiplied with Idf for all terms=
 in the =20
>> index, but
>> in Lucene =E2=80=9Ctf=E2=80=9D is calculated only for terms in the given=
 Query. =20
>> Because of
>> that effect, the norm calculation should also include =E2=80=9Cidf=E2=80=
=9D for all =20
>> terms.
>> Lucene calculates the norm, during indexing, by =E2=80=9Cjust=E2=80=9D c=
ounting the =20
>> number
>> of terms per document. In the web formula (in miislita.com), a =20
>> document norm
>> is calculated after multiplying =E2=80=9Ctf=E2=80=9D and =E2=80=9Cidf=E2=
=80=9D.
>=20
> Are you wondering if there is a way to score all documents regardless =20
> of whether the document has the term or not?  I don't quite get your =20
> statement: "In that model, =E2=80=9Ctf=E2=80=9D is multiplied with Idf fo=
r all terms =20
> in the index, but in Lucene =E2=80=9Ctf=E2=80=9D is calculated only for t=
erms in the =20
> given Query."
>=20
> Isn't the result for those documents that don't have query terms just =20
> going to be 0 or am I not fully understanding?  I briefly skimmed the =20
> paper you cite and it doesn't seem that different, it's just =20
> describing the Salton's VSM right?
>=20
>>
>>
>> FYI: I could implement =E2=80=9Cidf=E2=80=9D according to miisliat.com f=
ormula, but =20
>> not the
>> =E2=80=9Ctf=E2=80=9D and =E2=80=9Cnorm=E2=80=9D
>>
>> Could you please comment me how I can implement a new Similarity =20
>> class that
>> will fit in the Lucene=E2=80=99s architecture, but still implement the =
=20
>> vector space
>> model given in miislita.com
>=20
> In the end, you may need to implement some lower level Query classes, =20
> but I still don't fully understand what you are trying to do, so I =20
> wouldn't head down that path just yet.
>=20
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>=20
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>=20
>=20
>=20
>=20
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>=20
>=20
>=20

--=20
View this message in context: http://www.nabble.com/Vector-Space-Model%3A-N=
ew-Similarity-Implementation-Issues-tp15696719p15736946.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org