Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 88391 invoked from network); 27 Nov 2010 19:55:18 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 Nov 2010 19:55:18 -0000 Received: (qmail 49421 invoked by uid 500); 27 Nov 2010 19:55:18 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 49381 invoked by uid 500); 27 Nov 2010 19:55:17 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 49373 invoked by uid 99); 27 Nov 2010 19:55:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Nov 2010 19:55:17 +0000 X-ASF-Spam-Status: No, hits=3.5 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 74.125.82.42 as permitted sender) Received: from [74.125.82.42] (HELO mail-ww0-f42.google.com) (74.125.82.42) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Nov 2010 19:55:13 +0000 Received: by wwb29 with SMTP id 29so85930wwb.5 for ; Sat, 27 Nov 2010 11:54:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=uCRcmTMmG1FV0/jEsV/QSJcvG4oiWja4dIlmQg5YEGM=; b=NwgmlIrAjohXGqs+cs5fUnNud0I3CqV8zfW00N8ANqIbjD4CPfs4lzxCP/JUAzycQd xgy6KmIdiDoWSih+QozUabWZ26T0+DSo4Wy8nA12kNrQXTHyU5kvafeQqCLzJshYBied jbHIwP9AaQHJ6tmhJuJHJiC66ndqDXI2ZULv0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=aA2+TCRSgUq9SU4ZbET/6J+pVH25sMTtd2Mh/8Pwdzk7dWBzb7fm0kaYp7EZD6dfYE UlHLxJHyJFb8WGmPPp5Kt/xrOHgy3c7Oagx5/kIQ49xTe7HnwDBibxergeAOCffbGFoW LM/DneWu7T9iwbcvIsrwJ5KE6r8tZxBrh/UWs= Received: by 10.216.184.210 with SMTP id s60mr1493348wem.19.1290887691482; Sat, 27 Nov 2010 11:54:51 -0800 (PST) MIME-Version: 1.0 Received: by 10.216.158.68 with HTTP; Sat, 27 Nov 2010 11:54:31 -0800 (PST) In-Reply-To: References: <1290707359931-1968500.post@n3.nabble.com> From: Ted Dunning Date: Sat, 27 Nov 2010 11:54:31 -0800 Message-ID: Subject: Re: Cluster Retrieval in Lucene To: general@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e64985944bb5c104960e36e0 --0016e64985944bb5c104960e36e0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable So, yes. You can do this kind of retrieval using Lucene. Avoiding the details of the Liu and Croft method, the basic idea is that the observed words in a document can be augmented by means of a hierarchical language model. This means that there is a corpus language model describing the gross characteristics of the language in question. Below that are clusters= , each with their own model that is informed by both the corpus model and the details of the documents in the cluster. Finally, there are document language models informed by the contents of the document and the contents o= f the cluster and possibly the also by the corpus model. The document language model can be used to derive words that the author might well have said, if they had written longer. You can index these derived words just as easily as the words that actually appear. You may be able to get away with Lucene's native ability to modify weights on words or you might need to use the flexibility of the scoring system to build your own scoring system. Also, depending on the scale of your corpus, I would suggest that you might benefit from the Apache Mahout project's k-means clustering. If you allow multiple cluster probabilistic membership, then this language model probably reduces to either PLSI or LDA (I can't say which without detailed analysis). You could also start with those models and do the augmented indexing trick. Mahout has a reasonably nice implementation of LDA as well as k-means. My guess is that the hierarchical nature of the language model actually has gain even with LDA so you might want to do conventional clustering, LDA on the clusters, then LDA on the contents of the cluster. So the answer is yes. On Sat, Nov 27, 2010 at 1:19 AM, vermansi wrote: > > *Cluster*-*Based Retrieval* Using Language > Models< > http://www.google.co.in/url?sa=3Dt&source=3Dweb&cd=3D1&ved=3D0CCIQFjAA&ur= l=3Dhttp%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.= 83.4177%26rep%3Drep1%26type%3Dpdf&ei=3D8szwTPaOMJHIuAO3lIH6DQ&usg=3DAFQjCNE= iQCxvKNZMfGKk6pRtdLaqIY847g&sig2=3DGE_yyn_ow9KQojgwnZ2ACw > >by > X Liu - 2004 > > I hope this helps .. sorry for constantly giving incomplete information > > Regards > Manisha > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Cluster-Retrieval-in-Lucene-tp1968500p= 1976646.html > Sent from the Lucene - General mailing list archive at Nabble.com. > --0016e64985944bb5c104960e36e0--