Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 51980 invoked from network); 28 Feb 2008 14:00:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Feb 2008 14:00:51 -0000 Received: (qmail 92271 invoked by uid 500); 28 Feb 2008 14:00:39 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 92179 invoked by uid 500); 28 Feb 2008 14:00:39 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 92168 invoked by uid 99); 28 Feb 2008 14:00:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Feb 2008 06:00:39 -0800 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=DNS_FROM_OPENWHOIS,SPF_HELO_PASS,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Feb 2008 13:59:50 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1JUjIx-0006kd-04 for java-user@lucene.apache.org; Thu, 28 Feb 2008 06:00:11 -0800 Message-ID: <15736946.post@talk.nabble.com> Date: Thu, 28 Feb 2008 06:00:10 -0800 (PST) From: Dharmalingam To: java-user@lucene.apache.org Subject: Re: Vector Space Model: New Similarity Implementation Issues In-Reply-To: <003437AD-2007-41B7-9A67-B47CD7CE1E62@apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Nabble-From: dganesan@fc-md.umd.edu References: <15696719.post@talk.nabble.com> <003437AD-2007-41B7-9A67-B47CD7CE1E62@apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Thanks for the reply. Sorry if my explanation is not clear. Yes, you are correct the model is based on Salton's VSM. However, the calculation of th= e term weight and the doc norm is, in my opinion, different from Lucene. If you look at the table given in http://www.miislita.com/term-vector/term-vector-3.html, they calcuate the document norm based on the weight wi=3Dtfi*idfi. I looked at the interfaces= of Similarity and DefaultSimilairty class. I place it below: public float lengthNorm(String fieldName, int numTerms) { return (float)(1.0 / Math.sqrt(numTerms)); } You can see that this lengthNorm for a doc is quite different from that website norm calculation. Similarly, the querynorm interface of DefaultSimilarity class is: /** Implemented as 1/sqrt(sumOfSquaredWeights). */ public float queryNorm(float sumOfSquaredWeights) { return (float)(1.0 / Math.sqrt(sumOfSquaredWeights)); } This is again different the website model. I also have difficulities with tf interface of DefaultSimilarity:=20 /** Implemented as sqrt(freq). */ public float tf(float freq) { return (float)Math.sqrt(freq); } In that website model, a tf refers to the frequency of a term within a doc. I hope explained it better. Please let me know if it is unclear. I am looking for an easy way to implement that table, and of course want to integrate with my lucene ( i.e., myIndexWriter.setSimilarity(new mySimilarity());) Will this be possible by just somehow inheriting the base classes of Lucene. Thanks for your advice. Grant Ingersoll-6 wrote: >=20 > Not sure I am understanding what you are asking, but I will give it a =20 > shot. See below >=20 >=20 > On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote: >=20 >> >> Hi List, >> >> I am pretty new to Lucene. Certainly, it is very exciting. I need to >> implement a new Similarity class based on the Term Vector Space =20 >> Model given >> in http://www.miislita.com/term-vector/term-vector-3.html >> >> Although that model is similar to Lucene=E2=80=99s model >> (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apa= che/lucene/search/Similarity.html=20 >> ), >> I am having hard time to extend the Similarity class to calculate that >> model. >> >> In that model, =E2=80=9Ctf=E2=80=9D is multiplied with Idf for all terms= in the =20 >> index, but >> in Lucene =E2=80=9Ctf=E2=80=9D is calculated only for terms in the given= Query. =20 >> Because of >> that effect, the norm calculation should also include =E2=80=9Cidf=E2=80= =9D for all =20 >> terms. >> Lucene calculates the norm, during indexing, by =E2=80=9Cjust=E2=80=9D c= ounting the =20 >> number >> of terms per document. In the web formula (in miislita.com), a =20 >> document norm >> is calculated after multiplying =E2=80=9Ctf=E2=80=9D and =E2=80=9Cidf=E2= =80=9D. >=20 > Are you wondering if there is a way to score all documents regardless =20 > of whether the document has the term or not? I don't quite get your =20 > statement: "In that model, =E2=80=9Ctf=E2=80=9D is multiplied with Idf fo= r all terms =20 > in the index, but in Lucene =E2=80=9Ctf=E2=80=9D is calculated only for t= erms in the =20 > given Query." >=20 > Isn't the result for those documents that don't have query terms just =20 > going to be 0 or am I not fully understanding? I briefly skimmed the =20 > paper you cite and it doesn't seem that different, it's just =20 > describing the Salton's VSM right? >=20 >> >> >> FYI: I could implement =E2=80=9Cidf=E2=80=9D according to miisliat.com f= ormula, but =20 >> not the >> =E2=80=9Ctf=E2=80=9D and =E2=80=9Cnorm=E2=80=9D >> >> Could you please comment me how I can implement a new Similarity =20 >> class that >> will fit in the Lucene=E2=80=99s architecture, but still implement the = =20 >> vector space >> model given in miislita.com >=20 > In the end, you may need to implement some lower level Query classes, =20 > but I still don't fully understand what you are trying to do, so I =20 > wouldn't head down that path just yet. >=20 > -------------------------- > Grant Ingersoll > http://www.lucenebootcamp.com > Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam >=20 > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ >=20 >=20 >=20 >=20 >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 >=20 >=20 --=20 View this message in context: http://www.nabble.com/Vector-Space-Model%3A-N= ew-Similarity-Implementation-Issues-tp15696719p15736946.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org