mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Anil" <robin.a...@gmail.com>
Subject Re: LDA [was RE: Taste on Mahout]
Date Sat, 07 Jun 2008 11:36:04 GMT
Hi,.
     There some LDA/CRF implementations available online. Might prove useful
when writing the code

* GibbsLDA++ <http://gibbslda.sourceforge.net/>*: GibbsLDA++: A C/C++
Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for
parameter estimation and inference. GibbsLDA++ is fast and is designed to
analyze hidden/latent topic structures of large-scale (text) data
collections.

* CRFTagger <http://crftagger.sourceforge.net/> *: A Java-based Conditional
Random Fields Part-of-Speech (POS) Tagger for English. The model was trained
on sections 01..24 of WSJ corpus and using section 00 as the development
test set (accuracy of 97.00%). Tagging speed: 500 sentences / second.

* CRFChunker <http://crfchunker.sourceforge.net/> *: A Java-based
Conditional Random Fields Phrase Chunker (Phrase Chunking Tool) for English.
The model was trained on sections 01..24 of WSJ corpus and using section 00
as the development test set (F1-score of 95.77). Chunking speed: 700
sentences / second.

* JTextPro <http://jtextpro.sourceforge.net/>*: A Java-based text processing
tool that includes sentence boundary detection (using maximum entropy
classifier), word tokenization (following Penn convention), part-of-speech
tagging (using CRFTagger), and phrase chunking (using CRFChunker).

*JWebPro <http://jwebpro.sourceforge.net/>*: A Java-based tool that can
interact with Google search via Google Web APIs and then process the
returned Web documents in a couple of ways. The outputs of JWebPro can serve
as inputs for natural language processing, information retrieval,
information extraction, Web data mining, online social network
extraction/analysis, and ontology development applications.

* JVnSegmenter <http://jvnsegmenter.sourceforge.net/>*: A Java-based and
open-source Vietnamese word segmentation tool. The segmentation model in
this tool was trained on about 8,000 labeled sentences using FlexCRFs. It
would be useful for Vietnamese NLP community.
*FlexCRFs: Flexible Conditional Random Fields* (Including PCRFs - a parallel
version of FlexCRFs)  http://flexcrfs.sourceforge.net/

CRF++: Yet Another CRF toolkit *http://flexcrfs.sourceforge.net/*
Robin
On Thu, Jun 5, 2008 at 9:59 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> The buntine and jakulin paper is also useful reading.  I would avoid fancy
> stuff like the powell rao-ization to start.
>
> http://citeseer.ist.psu.edu/750239.html
>
> The gibb's sampling approach is, at its heart, very simple in that most of
> the math devolves into sampling discrete hidden variables from simple
> distributions and then counting the results as if they were observed.
>
> On Thu, Jun 5, 2008 at 5:49 AM, Goel, Ankur <Ankur.Goel@corp.aol.com>
> wrote:
>
> > It draws reference from Java implementation -
> > http://www.arbylon.net/projects/LdaGibbsSampler.java
> > which is a single class version of LDA using gibbs sampling with
> > slightly better code documentation.
> > I am trying to understand the code while reading the paper you suggested
> > -
> > "Distributed Inference for Latent Drichlet Allocation".
> >
> > -----Original Message-----
> > From: Daniel Kluesing [mailto:daniel@ilike-inc.com]
> > Sent: Wednesday, June 04, 2008 8:31 PM
> > To: mahout-dev@lucene.apache.org
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted may have a better one, but in my quick poking around at things
> > http://gibbslda.sourceforge.net/ looks to be a good implementation of
> > the Gibbs sampling approach.
> >
> > -----Original Message-----
> > From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > Sent: Wednesday, June 04, 2008 4:58 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted, Do you have a sequential version of LDA implementation that can be
> > used for reference ?
> > If yes, can you please post it on Jira ? Should we open a new Jira or
> > use MAHOUT-30 for this ?
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: Tuesday, May 27, 2008 11:50 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: LDA [was RE: Taste on Mahout]
> >
> > Chris Bishop's book has a very clear exposition of the relationship
> > between the variational techniques and EM.  Very good reading.
> >
> > On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <Ankur.Goel@corp.aol.com>
> > wrote:
> >
> > > Daniel/Ted,
> > >      Thanks for the interesting pointers to more information on LDA
> > > and EM.
> > > I am going through the docs to visualize and understand how LDA
> > > approach would work for my specific case.
> > >
> > > Once I have some idea, I can volunteer to work on the Map-Reduce side
> > > of
> > >
> > > thngs as this is something that will benefit both my project and the
> > > community.
> > >
> > > Looking forward to share more ideas/information on this :-)
> > >
> > > Regards
> > > -Ankur
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: Tuesday, May 27, 2008 6:59 AM
> > > To: mahout-dev@lucene.apache.org
> > > Subject: Re: LDA [was RE: Taste on Mahout]
> > >
> > > Those are both new to me.  Both look interesting.  My own experience
> > > is that the simplicity of the Gibb's sampling makes it very much more
> > > attractive for implementation.  Also, since it is (nearly) trivially
> > > parallelizable, it is more likely we will get a useful implementation
> > > right off the bat.
> > >
> > > On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing
> > > <daniel@ilike-inc.com>
> > > wrote:
> > >
> > > > (Hijacking the thread to discuss ways to implement LDA)
> > > >
> > > > Had you seen
> > > > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > > > ?
> > > >
> > > > Their hierarchical distributed LDA formulation uses gibbs sampling
> > > > and
> > >
> > > > fits into mapreduce.
> > > >
> > > > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > <http://www.c
> > > > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > > <http://www.cs.
> > > > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> > > formulation for the variational EM method.
> > > >
> > > > I'm still chewing on them, but my first impression is that the EM
> > > > approach would give better performance on bigger data sets. Opposing
> >
> > > > views welcome.
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > ted
> >
>
>
>
> --
> ted
>



-- 
Robin Anil
4th Year Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message