Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 35AEC10CD3 for ; Tue, 14 Jan 2014 06:57:50 +0000 (UTC) Received: (qmail 41681 invoked by uid 500); 14 Jan 2014 06:52:14 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 41580 invoked by uid 500); 14 Jan 2014 06:52:00 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 41532 invoked by uid 99); 14 Jan 2014 06:51:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jan 2014 06:51:44 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of kslisenko@gmail.com designates 209.85.213.176 as permitted sender) Received: from [209.85.213.176] (HELO mail-ig0-f176.google.com) (209.85.213.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jan 2014 06:51:39 +0000 Received: by mail-ig0-f176.google.com with SMTP id j1so6688527iga.3 for ; Mon, 13 Jan 2014 22:51:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=77CrbPq0oDdvbYM8koJvy6MK3NFKiGIDOmV7OQwA0Gc=; b=D2OfVo8YTywT6saqQTQckcS55nsJRN2+puinZKEGf5+IN+xYb0m0PYgFdL13PKFQNo xLYGAiQGLKLsSt8sw5p1Cts7wo0ElOX960Tm2R5sV92KGOa0J3+Pw5n3J6vkvW0WlyNJ gZGxUfzZnpCdZ4sSq7VUMUVZz01c5RkP2P+H5ahOp+TGtZ236pcZ28mQXzpy3kQlbYdj 8XHLGM1XjkA/Rt8jVrZJ2OK12ZhtDgmvCdrCoejoSFN7iNvHH/cG6vccc9u5eNspoo21 8y2h46Aotrd0Yxv9z/eG1QqTmtC3cbYzuG6Q9lqtskuoyrO9KPYz4SkRMt7D6HSBdKNr ubpw== MIME-Version: 1.0 X-Received: by 10.42.66.134 with SMTP id p6mr41126ici.85.1389682277272; Mon, 13 Jan 2014 22:51:17 -0800 (PST) Received: by 10.64.12.168 with HTTP; Mon, 13 Jan 2014 22:51:17 -0800 (PST) In-Reply-To: References: Date: Tue, 14 Jan 2014 09:51:17 +0300 Message-ID: Subject: Re: categorization on crawl data From: =?KOI8-R?B?68/O09TBztTJziDzzMnTxc7Lzw==?= To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=90e6ba613dd47ccfe404efe89e81 X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba613dd47ccfe404efe89e81 Content-Type: text/plain; charset=ISO-8859-1 Hi Vikas! For categorization any data you can try clustering algorithms, see this link http://mahout.apache.org/users/clustering/clusteringyourdata.html. Simple algorithms by my opinion is k-means http://mahout.apache.org/users/clustering/k-means-clustering.html. Which data do you have? If it is text data, you should first extract text, then do some preprocessing for better quality - remove stop-words (is, are, the, ...), switch words to lower case, also use Porter stem filter ( http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html). This can be done by custom Lucene Analyzer. The result should be in mahout sequence files format. Then you need to vectorize data ( http://mahout.apache.org/users/basics/creating-vectors-from-text.html). Then run clustering algorithm and interpret results. You can look at my experiments here https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout 2014/1/13 Vikas Parashar > Hi folks, > > Have anyone tried to do categorization on crawl data. If yes then how can i > achieve this? Which algorithm will help me? > > -- > Thanks & Regards:- > Vikas Parashar > Sr. Linux administrator Cum Developer > Mobile: +91 958 208 8852 > Email: vikas.parashar@fosteringlinglinux.com > --90e6ba613dd47ccfe404efe89e81--