mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: mahout tf-idf vs lucene tf-idf
Date Mon, 06 Jun 2016 17:02:27 GMT
to add to Ted's reply, mahout has traditionally offered a bigram/trigram
analysis as a part of its tf-idf conversion (a step away from the bag of
words model so that directional statistically stable combinations of 2 or 3
words are reduced to their own term). However, this has not been ported to
spark/h20/flink engines, and is available as a mapreduce legacy algorithm
only.

On Sat, Jun 4, 2016 at 2:14 AM, forme book <forbookmail@gmail.com> wrote:

> Hi,
>
> I'm start to study text processing and I see that for evaluating two text
> is possible to obtaing vector model through TF-IDF technique.
>
> With Mahout is possible to create vectors from text with the use of
> lucene.vector, if I have not misheard takes a lucene index and then map as
> a tf-idf,
>
> On the (Lucene side) has already by default this implementations, what I do
> struggle to understand what is the advantage of having lucene.vector in
> mahout when Lucene offer that feature out of the box ?
>
> Maybe I'm missing something big but what’s the Connection Between then ?
>  could you please explain a possible user case ?
>
> Thanks for help
>
> Richard
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message