Mailing-List: contact general-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com
 designates 74.125.82.42 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=aA2+TCRSgUq9SU4ZbET/6J+pVH25sMTtd2Mh/8Pwdzk7dWBzb7fm0kaYp7EZD6dfYE
         UlHLxJHyJFb8WGmPPp5Kt/xrOHgy3c7Oagx5/kIQ49xTe7HnwDBibxergeAOCffbGFoW
         LM/DneWu7T9iwbcvIsrwJ5KE6r8tZxBrh/UWs=
MIME-Version: 1.0
In-Reply-To: <AANLkTi=kqPcCsfqPtH=59z7z=zRDgXL-nxN14vb+oxgC@mail.gmail.com>
References: <1290707359931-1968500.post@n3.nabble.com>
 <AANLkTinH1yxfUh8r=UcMi+u27EFV5qpE85QpzDwLjQGx@mail.gmail.com>
 <AANLkTikUUUBkpa5TrLvwGsH6cQj0Q0VqoA3xmYktFq7C@mail.gmail.com>
 <AANLkTik88o5Hk3w87Lu4Bz-1dy=D-f+u1V2mKQaBv+qG@mail.gmail.com>
 <AANLkTinYjn+NSert4Ct1+=CtFkneZoZt=BZLB3nYAPOT@mail.gmail.com>
 <AANLkTimCGJdkf5f24LkmGDsCapC-VTJNDtFGcJYpJtOq@mail.gmail.com>
 <AANLkTi=kqPcCsfqPtH=59z7z=zRDgXL-nxN14vb+oxgC@mail.gmail.com>
From: Ted Dunning <ted.dunning@gmail.com>
Date: Sat, 27 Nov 2010 11:54:31 -0800
Message-ID: <AANLkTikVL3+_S9HBBMBsccOAgZskcNwzQdG21u46L-Fa@mail.gmail.com>
Subject: Re: Cluster Retrieval in Lucene
To: general@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016e64985944bb5c104960e36e0

--0016e64985944bb5c104960e36e0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

So, yes.  You can do this kind of retrieval using Lucene.  Avoiding the
details of the Liu and Croft method, the basic idea is that the observed
words in a document can be augmented by means of a hierarchical language
model.  This means that there is a corpus language model describing the
gross characteristics of the language in question.  Below that are clusters=
,
each with their own model that is informed by both the corpus model and the
details of the documents in the cluster.  Finally, there are document
language models informed by the contents of the document and the contents o=
f
the cluster and possibly the also by the corpus model.

The document language model can be used to derive words that the author
might well have said, if they had written longer.

You can index these derived words just as easily as the words that actually
appear.  You may be able to get away with Lucene's native ability to modify
weights on words or you might need to use the flexibility of the scoring
system to build your own scoring system.

Also, depending on the scale of your corpus, I would suggest that you might
benefit from the Apache Mahout project's k-means clustering.

If you allow multiple cluster probabilistic membership, then this language
model probably reduces to either PLSI or LDA (I can't say which without
detailed analysis).  You could also start with those models and do the
augmented indexing trick.  Mahout has a reasonably nice implementation of
LDA as well as k-means.  My guess is that the hierarchical nature of the
language model actually has gain even with LDA so you might want to do
conventional clustering, LDA on the clusters, then LDA on the contents of
the cluster.

So the answer is yes.

On Sat, Nov 27, 2010 at 1:19 AM, vermansi <vermansi@gmail.com> wrote:

>
> *Cluster*-*Based Retrieval* Using Language
> Models<
> http://www.google.co.in/url?sa=3Dt&source=3Dweb&cd=3D1&ved=3D0CCIQFjAA&ur=
l=3Dhttp%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.=
83.4177%26rep%3Drep1%26type%3Dpdf&ei=3D8szwTPaOMJHIuAO3lIH6DQ&usg=3DAFQjCNE=
iQCxvKNZMfGKk6pRtdLaqIY847g&sig2=3DGE_yyn_ow9KQojgwnZ2ACw
> >by
> X Liu - 2004
>
> I hope this helps .. sorry for constantly giving incomplete information
>
> Regards
> Manisha
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Cluster-Retrieval-in-Lucene-tp1968500p=
1976646.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

--0016e64985944bb5c104960e36e0--