Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7F05C1025F for ; Tue, 14 Jan 2014 10:48:20 +0000 (UTC) Received: (qmail 81867 invoked by uid 500); 14 Jan 2014 09:46:42 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 81697 invoked by uid 500); 14 Jan 2014 09:46:00 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 81278 invoked by uid 99); 14 Jan 2014 09:44:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jan 2014 09:44:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dechouxb@gmail.com designates 209.85.217.182 as permitted sender) Received: from [209.85.217.182] (HELO mail-lb0-f182.google.com) (209.85.217.182) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jan 2014 09:44:26 +0000 Received: by mail-lb0-f182.google.com with SMTP id l4so6236368lbv.13 for ; Tue, 14 Jan 2014 01:44:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Z44cq1cktVsHtYVz1HhDrfvrPclJ7Nz+x7qCmur8BCc=; b=J1597cKuU7H+waJCcWZbMEjdBPE8omGYrWa1CMEAIZPEF772sLHD5y3klBrQfjSY0z CHBhgUkh/6pnguNrXMPpilN7K30JIfqGfxJzXyZ+cSyzItA8b3uKC+sgfjaQ+oZnRkbq a07rnkgJD9VetQTyoOstbIpUJey+p5cRJYzjLur2RrLLAIIpKfhrtOwWS7W2YN2g6AOP rvHhUhIT+fdhBdBgbLpNOyYlnbpGSzH3nF+8nBDC/9Z0noc+OMIWrV/fXxzBQ05u1a/0 XoxeKgrV8Tg6og8ZgfOzX+Unbhn3Eqsq8UTudf2HqHzquh6F57rJSAEPlM0B+7gsvWbk 8qcw== MIME-Version: 1.0 X-Received: by 10.152.5.199 with SMTP id u7mr323971lau.48.1389692644257; Tue, 14 Jan 2014 01:44:04 -0800 (PST) Received: by 10.112.147.8 with HTTP; Tue, 14 Jan 2014 01:44:04 -0800 (PST) In-Reply-To: References: Date: Tue, 14 Jan 2014 10:44:04 +0100 Message-ID: Subject: Re: categorization on crawl data From: Bertrand Dechoux To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=089e013d1a7868736b04efeb08e5 X-Virus-Checked: Checked by ClamAV on apache.org --089e013d1a7868736b04efeb08e5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable It might seem like you would want do to entity extraction but that's not trivial and Mahout won't directly help in that area. Bertrand On Tue, Jan 14, 2014 at 10:05 AM, =D0=9A=D0=BE=D0=BD=D1=81=D1=82=D0=B0=D0= =BD=D1=82=D0=B8=D0=BD =D0=A1=D0=BB=D0=B8=D1=81=D0=B5=D0=BD=D0=BA=D0=BE wrote: > Hi Vikas! > > As I understand, you need to improve indexing of your data for exact > search. You can look to classification algorithms ( > http://mahout.apache.org/users/classification/classifyingyourdata.html). > You can define topics and train classifier. Then classifier will split yo= ur > data into several groups and then you can index your data. > > But I'm not sure that mahout is good for exact search if you want to find > switches with exact 24 ports. I think it could be better if index your da= ta > another way (using hadoop) and get exact parameters of every switch in > network, then you import this data into database with indexes. You can al= so > integrate Lucene to store database IDs. > > > 2014/1/14 Vikas Parashar > > > Thanks buddy, > > > > Actually, i have crawled data in my system. Let's say "data related to > all > > firewall,switches and router domains". With nutch i have crawled all th= e > > data in my segments(according to depth). > > > > Luckily, i have lucene solr on the top of hdfs. With the help of this,= i > > can easily search(like a google search) in my data. > > > > Now, my pain points begin; when my client needs attributes type search. > For > > e.g. I need to get all switches that have 24 ports. For that type of > > search, i supposed mahout will be in action. I don't know; i am going i= n > > right direction or not. But, what i am thinking, if i shall be able to > > trained my machine in such way so that it gives us desired results. We > all > > know, that machine will take some time to give us some +ve result. > Because, > > every machine need some time to become expert. But that is fine with me= . > > > > But again, for that we need to categorize my crawled data in at-least 3 > > parts(according to above example). > > > > Any guess! how can i achieve this. > > > > > > > > > > > > > > On Tue, Jan 14, 2014 at 12:21 PM, =D0=9A=D0=BE=D0=BD=D1=81=D1=82=D0=B0= =D0=BD=D1=82=D0=B8=D0=BD =D0=A1=D0=BB=D0=B8=D1=81=D0=B5=D0=BD=D0=BA=D0=BE > > wrote: > > > > > Hi Vikas! > > > > > > For categorization any data you can try clustering algorithms, see th= is > > > link http://mahout.apache.org/users/clustering/clusteringyourdata.htm= l > . > > > Simple algorithms by my opinion is k-means > > > http://mahout.apache.org/users/clustering/k-means-clustering.html. > > > > > > Which data do you have? > > > > > > If it is text data, you should first extract text, then do some > > > preprocessing for better quality - remove stop-words (is, are, the, > ...), > > > switch words to lower case, also use Porter stem filter ( > > > > > > > > > http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/P= orterStemFilter.html > > > ). > > > This can be done by custom Lucene Analyzer. The result should be in > > mahout > > > sequence files format. Then you need to vectorize data ( > > > http://mahout.apache.org/users/basics/creating-vectors-from-text.html > ). > > > Then run clustering algorithm and interpret results. > > > > > > You can look at my experiments here > > > > > > > > > https://github.com/kslisenko/big-data-research/tree/master/Developments/s= tackexchange-analyses/stackexchange-analyses-hadoop-mahout > > > > > > > > > 2014/1/13 Vikas Parashar > > > > > > > Hi folks, > > > > > > > > Have anyone tried to do categorization on crawl data. If yes then h= ow > > > can i > > > > achieve this? Which algorithm will help me? > > > > > > > > -- > > > > Thanks & Regards:- > > > > Vikas Parashar > > > > Sr. Linux administrator Cum Developer > > > > Mobile: +91 958 208 8852 > > > > Email: vikas.parashar@fosteringlinglinux.com > > > > > > > > > > > > > > > -- > > Thanks & Regards:- > > Vikas Parashar > > Sr. Linux administrator Cum Developer > > Mobile: +91 958 208 8852 > > Email: vikas.parashar@fosteringlinglinux.com > > > --089e013d1a7868736b04efeb08e5--