Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 79814 invoked from network); 8 Oct 2005 12:01:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 8 Oct 2005 12:01:09 -0000 Received: (qmail 59447 invoked by uid 500); 8 Oct 2005 12:01:03 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 59419 invoked by uid 500); 8 Oct 2005 12:01:03 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 59408 invoked by uid 99); 8 Oct 2005 12:01:03 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2005 05:01:02 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=HTML_20_30,HTML_MESSAGE,RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of adam.saltiel@gmail.com designates 66.249.82.202 as permitted sender) Received: from [66.249.82.202] (HELO xproxy.gmail.com) (66.249.82.202) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2005 05:01:06 -0700 Received: by xproxy.gmail.com with SMTP id s15so473824wxc for ; Sat, 08 Oct 2005 05:00:41 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:references; b=cF59ILexsHS49K10kCG1/xuVv+vJ62JqlSla+K7okCH9UqO5PVwX8fwUvm12KbFd02lQWeUiqknLvL2p3qcYAziwRr6J2Y9FTAm/FBPCAPtoBKYcbQjUTTmnhKVUsUo7aPhO0dEAZgANn5pin0ENcFle3a2kIFzQWM0FmG9vjSc= Received: by 10.70.126.18 with SMTP id y18mr2588888wxc; Sat, 08 Oct 2005 05:00:41 -0700 (PDT) Received: by 10.70.105.18 with HTTP; Sat, 8 Oct 2005 05:00:41 -0700 (PDT) Message-ID: Date: Sat, 8 Oct 2005 13:00:41 +0100 From: adasal Reply-To: adasal To: java-user@lucene.apache.org, Lorenzo Viscanti Subject: Re: Regarding Lucene and LSI In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_8429_4337279.1128772841226" References: <400e82f2efc192f468ff75c566afa231@activemath.org> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N ------=_Part_8429_4337279.1128772841226 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Very interesting. Actually, reading Meet Lucene Part 2 by Otis Gospondnetic and Eric Hatcher there is mention of Egothor and MG4J. Egothor. What is not mentioned is Carrot" which, I think, originally has been used with Egothor but has also been submitted to Lucene CVS. There is a helpful discussion with links that can be found in lucene-user Re: Term Weights and Clustering from Feb 2005. MG4J also looks very interesting. There are good resources about it on sebastiano Vigna's web site, vigna.dsi.unimi.it = . The mention the use of Bloom filters, an aticle about which has been writte= n by Maciej Ceglowski who is (largely) responsible for the LSI/ContextGraph implementation at NITLE. Article Using Bloom Filters on Perl.com. Bloom filters look a bit like the random index from Sahlgren I have mentioned. Much to do! Adam On 10/7/05, Lorenzo Viscanti wrote: > > I use my own LSI implementation based on Lucene for text clustering. > I've done some tests, but I do believe that integrating LSI onto the > lucene > search subsystem (i.e. creating something like LSISimilarity) is not an > easy > task > > I start analyzing the documents using Lucene, and then extract tfidf > values > (with lucene again), in order to build a documents/terms matrix. Then I > use > an implementation of LSI/SVD to analyze it. > At this point I think that reassigning the scores back to Lucene document= s > is very difficult; but I'm trying to grab the modified scores from the > matrix on my LSISImilarity. > Instead clustering search results this way is not too difficult, I just > apply the algorithm (mostly HAC-like) to the modified matrix. > To search using LSI you must choose a small subset of the collection and > then apply LSI/SVD to it, then extend the matrix by 'folding in' new > documents. But how to choose the initial subset? Maybe just searching the > index and then using the first n documents retrieved. > Any idea? > Lorenzo > > On 10/7/05, Paul Libbrecht wrote: > > > > > > I've met other persons with such needs and we would also be interested. > > > > Unfortunately, this seems not to be available. > > A clear issue might be that LSI, in its original form at least, is > > covered by an US patent. But maybe someone finds another form which is > > not. > > > > paul > > > > > > Le 5 oct. 05, =E0 14:59, a =E9crit : > > > I am looking for LSI implementation i lucene. Is it available. I > > > couldnt find it in the website. I searched in the archives but no > > > help. could some one tell me if it is available or not. > > > > > > Could you tell me where can i see to find if there are any Language > > > processing tools for Indexing and retrieval stuff available > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > ------=_Part_8429_4337279.1128772841226--