mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: categorization on crawl data
Date Tue, 14 Jan 2014 09:44:04 GMT
It might seem like you would want do to entity extraction but that's not
trivial and Mahout won't directly help in that area.

Bertrand

On Tue, Jan 14, 2014 at 10:05 AM, Константин Слисенко
<kslisenko@gmail.com>wrote:

> Hi Vikas!
>
> As I understand, you need to improve indexing of your data for exact
> search. You can look to classification algorithms (
> http://mahout.apache.org/users/classification/classifyingyourdata.html).
> You can define topics and train classifier. Then classifier will split your
> data into several groups and then you can index your data.
>
> But I'm not sure that mahout is good for exact search if you want to find
> switches with exact 24 ports. I think it could be better if index your data
> another way (using hadoop) and get exact parameters of every switch in
> network, then you import this data into database with indexes. You can also
> integrate Lucene to store database IDs.
>
>
> 2014/1/14 Vikas Parashar <vikas.parashar@fosteringlinux.com>
>
> > Thanks buddy,
> >
> > Actually, i have crawled data in my system. Let's say "data related to
> all
> > firewall,switches and router domains". With nutch i have crawled all the
> > data in my segments(according to depth).
> >
> > Luckily, i have lucene solr  on the top of hdfs. With the help of this, i
> > can easily search(like a google search) in my data.
> >
> > Now, my pain points begin; when my client needs attributes type search.
> For
> > e.g. I need to get all switches that have 24 ports. For that type of
> > search, i supposed mahout will be in action. I don't know; i am going in
> > right direction or not. But, what i am thinking, if i shall be able to
> > trained my machine in such way so that it gives us desired results. We
> all
> > know, that machine will take some time to give us some +ve result.
> Because,
> > every machine need some time to become expert. But that is fine with me.
> >
> > But again, for that we need to categorize my crawled data in at-least 3
> > parts(according to above example).
> >
> > Any guess! how can i achieve this.
> >
> >
> >
> >
> >
> >
> > On Tue, Jan 14, 2014 at 12:21 PM, Константин Слисенко
> > <kslisenko@gmail.com>wrote:
> >
> > > Hi Vikas!
> > >
> > > For categorization any data you can try clustering algorithms, see this
> > > link http://mahout.apache.org/users/clustering/clusteringyourdata.html
> .
> > > Simple algorithms by my opinion is k-means
> > > http://mahout.apache.org/users/clustering/k-means-clustering.html.
> > >
> > > Which data do you have?
> > >
> > > If it is text data, you should first extract text, then do some
> > > preprocessing for better quality - remove stop-words (is, are, the,
> ...),
> > > switch words to lower case, also use Porter stem filter (
> > >
> > >
> >
> http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html
> > > ).
> > > This can be done by custom Lucene Analyzer. The result should be in
> > mahout
> > > sequence files format. Then you need to vectorize data (
> > > http://mahout.apache.org/users/basics/creating-vectors-from-text.html
> ).
> > > Then run clustering algorithm and interpret results.
> > >
> > > You can look at my experiments here
> > >
> > >
> >
> https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout
> > >
> > >
> > > 2014/1/13 Vikas Parashar <vikas.parashar@fosteringlinux.com>
> > >
> > > > Hi folks,
> > > >
> > > > Have anyone tried to do categorization on crawl data. If yes then how
> > > can i
> > > > achieve this? Which algorithm will help me?
> > > >
> > > > --
> > > > Thanks & Regards:-
> > > > Vikas Parashar
> > > > Sr. Linux administrator Cum Developer
> > > > Mobile: +91 958 208 8852
> > > > Email: vikas.parashar@fosteringlinglinux.com
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards:-
> > Vikas Parashar
> > Sr. Linux administrator Cum Developer
> > Mobile: +91 958 208 8852
> > Email: vikas.parashar@fosteringlinglinux.com
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message