mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <>
Subject Re: mahout text mining
Date Fri, 17 Jan 2014 04:08:35 GMT
for classifying twitter messages.

Lucene has support for ngrams, stopwords, porter stemmer, snowball stemmer, language specific
analyzers etc...
Mahout uses Lucene for vectorization (part of Mahout's seq2sparse process).  

On Thursday, January 16, 2014 10:57 PM, qiaoresearcher <> wrote:
Mahout has an example of using naive bayes to classify 20 news group. but
how to just classify paragraphs  (e.g. twitter message, movie review) in
text files such as:

Text files has content like:
text paragraph 1                     class a
text paragraph 2                     class b
text paragraph 3                     class a
text paragraph 4                     class b
.............                                      ...

does it support n grams, stem, stop words, etc?

thanks for any suggestions.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message