mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Константин Слисенко <>
Subject Re: categorization on crawl data
Date Tue, 14 Jan 2014 06:51:17 GMT
Hi Vikas!

For categorization any data you can try clustering algorithms, see this
Simple algorithms by my opinion is k-means

Which data do you have?

If it is text data, you should first extract text, then do some
preprocessing for better quality - remove stop-words (is, are, the, ...),
switch words to lower case, also use Porter stem filter (
This can be done by custom Lucene Analyzer. The result should be in mahout
sequence files format. Then you need to vectorize data (
Then run clustering algorithm and interpret results.

You can look at my experiments here

2014/1/13 Vikas Parashar <>

> Hi folks,
> Have anyone tried to do categorization on crawl data. If yes then how can i
> achieve this? Which algorithm will help me?
> --
> Thanks & Regards:-
> Vikas Parashar
> Sr. Linux administrator Cum Developer
> Mobile: +91 958 208 8852
> Email:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message