lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aida Hota <hota.a...@gmail.com>
Subject Re: Calculate Term Co-occurrence Matrix
Date Mon, 23 Aug 2010 19:07:38 GMT
Hi Ivan!

sorry for not being clear, i am talking about term ngrams, shingles....
Something like:

poster
online advertising
yellow cab
this is phrase
sunshine
good morning sunshine

with their frequencies. That is,  that these that are returned are some
popular phrases and terns, which go over certain threshold.

Thanx Ivan




On Mon, Aug 23, 2010 at 8:41 PM, Ivan Provalov <iprovalo@yahoo.com> wrote:

> Aida,
>
> Are you talking about letter n-grams or term n-grams?
>
> Thanks,
>
> Ivan
>
> --- On Mon, 8/23/10, Aida Hota <hota.aida@gmail.com> wrote:
>
> > From: Aida Hota <hota.aida@gmail.com>
> > Subject: Re: Calculate Term Co-occurrence Matrix
> > To: java-user@lucene.apache.org
> > Date: Monday, August 23, 2010, 1:36 PM
> > Hi Ivan thanx a lot for this. I just
> > caught time to see this and reply,
> > sorry for bugging again, I appreciate already what you
> > uploaded . I would
> > also like to ask one question, if you dont mind. If it is
> > possible somehow
> > to get from this unified list of frequently occuring
> > unigrams, bigrams and
> > trigrams with their frequencies????
> >
> > Thank you very much
> >
> >
> > On Mon, Aug 23, 2010 at 3:22 PM, Ivan Provalov <iprovalo@yahoo.com>
> > wrote:
> >
> > > Ahmed, if you want the raw score, you can do it the
> > way you describe below.
> > >
> > >
> > >
> > > --- On Sun, 8/22/10, ahmed algohary <algoharyalex@gmail.com>
> > wrote:
> > >
> > > > From: ahmed algohary <algoharyalex@gmail.com>
> > > > Subject: Re: Calculate Term Co-occurrence Matrix
> > > > To: java-user@lucene.apache.org
> > > > Date: Sunday, August 22, 2010, 9:27 AM
> > > > I think I got it.
> > > >
> > > > In the CollectionIndexer class, I have added the
> > > > co-occurrence score to the
> > > > index document:
> > > >
> > > >  doc.add(new Field("score",
> > collocation.getScore() + "",
> > > >
> > > > Field.Store.YES, Field.Index.NOT_ANALYZED));
> > > >
> > > > then in the CollectionSearcher, the scores can
> > be
> > > > retrieved:
> > > >
> > > >  d.get("score")
> > > >
> > > > Is that correct ??
> > > >
> > > > On Sun, Aug 22, 2010 at 2:47 PM, ahmed algohary
> > <algoharyalex@gmail.com
> > > >wrote:
> > > >
> > > > > Thanks! It is exactly what I need. But,
> > isn't there a
> > > > way to get the
> > > > > matching score ?
> > > > >
> > > > > for example, "damaged"  co-occurs with
> > "shipment"
> > > > with a probability = 0.4
> > > > > ??
> > > > >
> > > > >
> > > > > On Sun, Aug 22, 2010 at 5:35 AM, Ivan
> > Provalov <iprovalo@yahoo.com>
> > > > wrote:
> > > > >
> > > > >> Ahmed,
> > > > >>
> > > > >> FYI, I updated the term collocations
> > package I
> > > > mentioned earlier with a
> > > > >> few fixes and changes which will make it
> > work for
> > > > Lucene 3.0.2.  This may
> > > > >> help your task.
> > > > >>
> > > > >> See:
> > > > >> https://issues.apache.org/jira/browse/LUCENE-474
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Ivan Provalov
> > > > >>
> > > > >>
> > > > >> --- On Sat, 8/21/10, Otis Gospodnetic
> > <otis_gospodnetic@yahoo.com>
> > > > wrote:
> > > > >>
> > > > >> > From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
> > > > >> > Subject: Re: Calculate Term
> > Co-occurrence
> > > > Matrix
> > > > >> > To: java-user@lucene.apache.org
> > > > >> > Date: Saturday, August 21, 2010,
> > 8:05 AM
> > > > >> > Ahmed,
> > > > >> >
> > > > >> > That's what that KPE (link in my
> > previous
> > > > email, below)
> > > > >> > will do for you.  It's
> > > > >> > not open source at this time, but
> > that is
> > > > exactly one of
> > > > >> > the things it does.  I
> > > > >> > think Mahout collocations stuff
> > might work
> > > > for you, too.
> > > > >> >
> > > > >> > Otis
> > > > >> > ----
> > > > >> > Sematext :: http://sematext.com/ :: Solr - Lucene -
> > > > Nutch
> > > > >> > Lucene ecosystem search :: http://search-lucene.com/
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > ----- Original Message ----
> > > > >> > > From: ahmed algohary <algoharyalex@gmail.com>
> > > > >> > > To: java-user@lucene.apache.org
> > > > >> > > Sent: Sat, August 21, 2010
> > 7:20:03 AM
> > > > >> > > Subject: Re: Calculate Term
> > > > Co-occurrence Matrix
> > > > >> > >
> > > > >> > > Thanks for all your answers!
> > > > >> > >
> > > > >> > > it seems like I did not make
> > my
> > > > question  clear.
> > > > >> > I have a text corpus and I
> > > > >> > > need to determine the pairs of
> > words
> > > > that  occur
> > > > >> > together in many documents.
> > > > >> > > I need to do that to be able
> > to measure
> > > > the
> > > > >> > semantic proximity between
> > > > >> > > words. This method is
> > expanded
> > > > >> > > here<http://forums.searchenginewatch.com/showthread.php?t=48
> >.
> > > > >> > > I hope to  find some code
> > that
> > > > given a text
> > > > >> > corpus, generate all the words
> > > > >> > > pairs with  their
> > probability of
> > > > occurring
> > > > >> > together.
> > > > >> > >
> > > > >> > >
> > > > >> > > On Sat, Aug 21, 2010 at
> > 1:46  AM,
> > > > Otis
> > > > >> > Gospodnetic <
> > > > >> > > otis_gospodnetic@yahoo.com>
> > > > >> > wrote:
> > > > >> > >
> > > > >> > > > There is also a
> > non-Mahout Key
> > > > Phrase Extractor
> > > > >> > for  Collocations, SIPs, and
> > > > >> > > > a
> > > > >> > > > few other things:
> > > > >> > > >
> http://sematext.com/products/key-phrase-extractor/index.html
> > > > >> > > >
> > > > >> > > >  One of the demos
> > that uses
> > > > news data is at
> > > > >> > > > http://sematext.com/demo/kpe/index.html
> > > > >> > > >
> > > > >> > > > Otis
> > > > >> > > >  ----
> > > > >> > > > Sematext :: http://sematext.com/ :: Solr - Lucene
-
> > > > >> > Nutch
> > > > >> > > > Lucene ecosystem
> > search :: http://search-lucene.com/
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > ----- Original
> > Message ----
> > > > >> > > > > From: Grant
> > Ingersoll <gsingers@apache.org>
> > > > >> > > > > To: java-user@lucene.apache.org
> > > > >> > > >  > Sent: Fri,
> > August 20,
> > > > 2010 8:52:17 AM
> > > > >> > > > > Subject: Re:
> > Calculate
> > > > Term
> > > > >> > Co-occurrence Matrix
> > > > >> > > > >
> > > > >> > > > > You might also be
> > > > interested  in
> > > > >> > Mahout's collocations package:
> > > > >> > > > >
> http://cwiki.apache.org/confluence/display/MAHOUT/Collocations
> > > > >> > > >  >
> > > > >> > > > > -Grant
> > > > >> > > > > On  Aug 19,
> > 2010, at
> > > > 11:39 AM,
> > > > >> > ahmed  algohary wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi all,
> > > > >> > > > > >
> > > > >> > > >  > > I need to
> > know if
> > > > there is a
> > > > >> > Lucene plug-in or a Lucene-based
> > > > API  for
> > > > >> > > > > > calculating the
> > term
> > > > co-occurrence
> > > > >> > matrix for a  given text
> > corpus.
> > > > >> > > > > >
> > > > >> > > > > > Thanks!
> > > > >> > > >  > >
> > > > >> > > > > > --
> > > > >> > > > > >  Ahmed
> > > > >> > > >  >
> > > > >> > > > >
> > --------------------------
> > > > >> > > > > Grant
> > Ingersoll
> > > > >> > > > > http://www.lucidimagination.com/
> > > > >> > > > >
> > > > >> > > > > Search the
> > Lucene
> > > > ecosystem
> > > > >> > using  Solr/Lucene:
> > > > >> > > > >http://www.lucidimagination.com/search
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > >  >
> > > > >> >
> > > >
> > ---------------------------------------------------------------------
> > > > >> > > >  > To
> > unsubscribe,
> > > > e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> > > >  > For
> > additional
> > > > commands, e-mail:
> > > > >> > java-user-help@lucene.apache.org
> > > > >> > > >  >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> >
> > > >
> > ---------------------------------------------------------------------
> > > > >> > > > To  unsubscribe,
> > e-mail:
> > > java-user-unsubscribe@lucene.apache.org
> > > > >> > > >  For additional
> > commands,
> > > > e-mail: java-user-help@lucene.apache.org
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > >
> > ---------------------------------------------------------------------
> > > > >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > >> >
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > >
> > ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message