mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arian Pasquali <ar...@arianpasquali.com>
Subject Re: word weights using BM25
Date Wed, 01 Oct 2014 12:52:54 GMT
Hi Ted,

My dataset is a collection of documents in german and I can say that the
scores seems better compared to my TFIDF scores. Results make more sense
now, specially my bi-grams.




Arian Pasquali
http://about.me/arianpasquali

2014-10-01 13:09 GMT+01:00 Ted Dunning <ted.dunning@gmail.com>:

> Thanks so much for the feedback.  Glad to hear it was straightforward.
>
>
> But the important question is ....
>
> how did BM25 work for you?
>
>
>
> On Wed, Oct 1, 2014 at 6:18 AM, Arian Pasquali <arian@arianpasquali.com>
> wrote:
>
> > Hey guys,
> > I think it is fair to give you some feedback.
> > I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
> > term
> > score on Mahout.
> > It was straightforward using the current TFIDF implementation as an
> > example.
> >
> > Basically what I did was implement the interface
> > org.apache.mahout.vectorizer.Weight, create a BM25Converter and
> > BM25PartialVectorReducer similar to TFIDFConverter
> > <
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
> > >
> > and
> > TFIDFPartialVectorReducer
> > <
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
> > >
> >  respectively .
> >
> > cheers
> > Arian
> >
> > Arian Pasquali
> > http://about.me/arianpasquali
> >
> > 2014-09-24 14:14 GMT+01:00 Arian Pasquali <arian@arianpasquali.com>:
> >
> > > Yes,
> > > I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and
> the
> > > current mahout's tfidf code.
> > > Trying to understand how I would port that to mr.
> > > I ll try to share something if I succeed.
> > >
> > > Arian Pasquali
> > > http://about.me/arianpasquali
> > >
> > > 2014-09-24 5:12 GMT+01:00 Suneel Marthi <suneel.marthi@gmail.com>:
> > >
> > >> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
> > >>
> > >> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <ted.dunning@gmail.com>
> > >> wrote:
> > >>
> > >> > Should be pretty easy. I haven't heard of anyone doing it.
> > >> >
> > >> > Sent from my iPhone
> > >> >
> > >> > > On Sep 23, 2014, at 18:53, Arian Pasquali <
> arian@arianpasquali.com>
> > >> > wrote:
> > >> > >
> > >> > > Hi,
> > >> > > I was wondering if would be possible to support bm25 term
> weighting
> > >> > > extending Mahout's tf-idf implementation.
> > >> > >
> > >> > > I was curious to know if anyone here has already tried to do
so.
> > >> > > If not, what would be your suggestion for such implementation
on
> > >> Mahout?
> > >> > >
> > >> > >
> > >> > > Arian Pasquali
> > >> > > http://about.me/arianpasquali
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message