Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 49177 invoked from network); 22 Aug 2010 13:28:36 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Aug 2010 13:28:36 -0000 Received: (qmail 61265 invoked by uid 500); 22 Aug 2010 13:28:34 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 60939 invoked by uid 500); 22 Aug 2010 13:28:30 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 60931 invoked by uid 99); 22 Aug 2010 13:28:29 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Aug 2010 13:28:29 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of algoharyalex@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Aug 2010 13:28:08 +0000 Received: by qwk3 with SMTP id 3so5518542qwk.35 for ; Sun, 22 Aug 2010 06:27:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=uv87zslvkGy9heNKX9OteqBN2evbs/PYbKTqALM//mQ=; b=rSz+btUmKJIvpwOuKxLBwLtm53vJ+6ak8Z1eXv2z7ciCLrlcejCRk7/y9xRkzgyCJ5 y8YEwJDwwjrNgQhaivBJhvFGNoWhHd7NTjOJ8H5QbQIY8fVx2RYkOzMCo+9/a6VxuYxL blGnJq66mq+kODlFRLPdxgZL5Wd4xfFR/jL7I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=F0sW1rvCb7t2M8dQfF+i9y3eJt1qnTpzneHZakUjktmRAdZMLE+74zkOd3H3sSRR/d bQ+DfzXx+e8Z9XlpjIZrH4pj35eMKJTro6mX9CkiYJdkhs0jP6ycGSYV3+OdkVJQFbCo G43UKCQQp1TIde8d0P8NSx1zORoLr6Hvdoe+M= Received: by 10.229.11.11 with SMTP id r11mr1867108qcr.240.1282483667113; Sun, 22 Aug 2010 06:27:47 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.227.10 with HTTP; Sun, 22 Aug 2010 06:27:27 -0700 (PDT) In-Reply-To: References: <24931.12945.qm@web50301.mail.re2.yahoo.com> <531769.90099.qm@web113315.mail.gq1.yahoo.com> From: ahmed algohary Date: Sun, 22 Aug 2010 15:27:27 +0200 Message-ID: Subject: Re: Calculate Term Co-occurrence Matrix To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016364ecd62689835048e697fe6 X-Virus-Checked: Checked by ClamAV on apache.org --0016364ecd62689835048e697fe6 Content-Type: text/plain; charset=ISO-8859-1 I think I got it. In the CollectionIndexer class, I have added the co-occurrence score to the index document: doc.add(new Field("score", collocation.getScore() + "", Field.Store.YES, Field.Index.NOT_ANALYZED)); then in the CollectionSearcher, the scores can be retrieved: d.get("score") Is that correct ?? On Sun, Aug 22, 2010 at 2:47 PM, ahmed algohary wrote: > Thanks! It is exactly what I need. But, isn't there a way to get the > matching score ? > > for example, "damaged" co-occurs with "shipment" with a probability = 0.4 > ?? > > > On Sun, Aug 22, 2010 at 5:35 AM, Ivan Provalov wrote: > >> Ahmed, >> >> FYI, I updated the term collocations package I mentioned earlier with a >> few fixes and changes which will make it work for Lucene 3.0.2. This may >> help your task. >> >> See: >> https://issues.apache.org/jira/browse/LUCENE-474 >> >> Thanks, >> >> Ivan Provalov >> >> >> --- On Sat, 8/21/10, Otis Gospodnetic wrote: >> >> > From: Otis Gospodnetic >> > Subject: Re: Calculate Term Co-occurrence Matrix >> > To: java-user@lucene.apache.org >> > Date: Saturday, August 21, 2010, 8:05 AM >> > Ahmed, >> > >> > That's what that KPE (link in my previous email, below) >> > will do for you. It's >> > not open source at this time, but that is exactly one of >> > the things it does. I >> > think Mahout collocations stuff might work for you, too. >> > >> > Otis >> > ---- >> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> > Lucene ecosystem search :: http://search-lucene.com/ >> > >> > >> > >> > ----- Original Message ---- >> > > From: ahmed algohary >> > > To: java-user@lucene.apache.org >> > > Sent: Sat, August 21, 2010 7:20:03 AM >> > > Subject: Re: Calculate Term Co-occurrence Matrix >> > > >> > > Thanks for all your answers! >> > > >> > > it seems like I did not make my question clear. >> > I have a text corpus and I >> > > need to determine the pairs of words that occur >> > together in many documents. >> > > I need to do that to be able to measure the >> > semantic proximity between >> > > words. This method is expanded >> > > here. >> > > I hope to find some code that given a text >> > corpus, generate all the words >> > > pairs with their probability of occurring >> > together. >> > > >> > > >> > > On Sat, Aug 21, 2010 at 1:46 AM, Otis >> > Gospodnetic < >> > > otis_gospodnetic@yahoo.com> >> > wrote: >> > > >> > > > There is also a non-Mahout Key Phrase Extractor >> > for Collocations, SIPs, and >> > > > a >> > > > few other things: >> > > > http://sematext.com/products/key-phrase-extractor/index.html >> > > > >> > > > One of the demos that uses news data is at >> > > > http://sematext.com/demo/kpe/index.html >> > > > >> > > > Otis >> > > > ---- >> > > > Sematext :: http://sematext.com/ :: Solr - Lucene - >> > Nutch >> > > > Lucene ecosystem search :: http://search-lucene.com/ >> > > > >> > > > >> > > > >> > > > ----- Original Message ---- >> > > > > From: Grant Ingersoll >> > > > > To: java-user@lucene.apache.org >> > > > > Sent: Fri, August 20, 2010 8:52:17 AM >> > > > > Subject: Re: Calculate Term >> > Co-occurrence Matrix >> > > > > >> > > > > You might also be interested in >> > Mahout's collocations package: >> > > > >http://cwiki.apache.org/confluence/display/MAHOUT/Collocations >> > > > > >> > > > > -Grant >> > > > > On Aug 19, 2010, at 11:39 AM, >> > ahmed algohary wrote: >> > > > > >> > > > > > Hi all, >> > > > > > >> > > > > > I need to know if there is a >> > Lucene plug-in or a Lucene-based API for >> > > > > > calculating the term co-occurrence >> > matrix for a given text corpus. >> > > > > > >> > > > > > Thanks! >> > > > > > >> > > > > > -- >> > > > > > Ahmed >> > > > > >> > > > > -------------------------- >> > > > > Grant Ingersoll >> > > > > http://www.lucidimagination.com/ >> > > > > >> > > > > Search the Lucene ecosystem >> > using Solr/Lucene: >> > > > >http://www.lucidimagination.com/search >> > > > > >> > > > > >> > > > > >> > --------------------------------------------------------------------- >> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> > > > > For additional commands, e-mail: >> > java-user-help@lucene.apache.org >> > > > > >> > > > > >> > > > >> > > > >> > --------------------------------------------------------------------- >> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> > > > For additional commands, e-mail: java-user-help@lucene.apache.org >> > > > >> > > > >> > > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> > For additional commands, e-mail: java-user-help@lucene.apache.org >> > >> > >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > --0016364ecd62689835048e697fe6--