mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Константин Слисенко <kslise...@gmail.com>
Subject Re: categorization on crawl data
Date Tue, 14 Jan 2014 06:51:17 GMT
Hi Vikas!

For categorization any data you can try clustering algorithms, see this
link http://mahout.apache.org/users/clustering/clusteringyourdata.html.
Simple algorithms by my opinion is k-means
http://mahout.apache.org/users/clustering/k-means-clustering.html.

Which data do you have?

If it is text data, you should first extract text, then do some
preprocessing for better quality - remove stop-words (is, are, the, ...),
switch words to lower case, also use Porter stem filter (
http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html).
This can be done by custom Lucene Analyzer. The result should be in mahout
sequence files format. Then you need to vectorize data (
http://mahout.apache.org/users/basics/creating-vectors-from-text.html).
Then run clustering algorithm and interpret results.

You can look at my experiments here
https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout


2014/1/13 Vikas Parashar <vikas.parashar@fosteringlinux.com>

> Hi folks,
>
> Have anyone tried to do categorization on crawl data. If yes then how can i
> achieve this? Which algorithm will help me?
>
> --
> Thanks & Regards:-
> Vikas Parashar
> Sr. Linux administrator Cum Developer
> Mobile: +91 958 208 8852
> Email: vikas.parashar@fosteringlinglinux.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message