mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Re : Good starting instance for AMI
Date Tue, 12 Jan 2010 03:33:05 GMT
Since i dont have a personal linux box these days. I code on eclipse on
windows and fire up an instance attach the ebs and patch and test my code.
yes, I have only tried a single node yet.


On Tue, Jan 12, 2010 at 8:55 AM, Liang Chenmin <liangchenmin04@gmail.com>wrote:

> I first followed the tutorial about running mahout on EMR, need some
> revision on the command line though.
>
> On Mon, Jan 11, 2010 at 6:44 PM, deneche abdelhakim <a_deneche@yahoo.fr
> >wrote:
>
> > I used Cloudera's with Mahout to test the Decision Forest implementation.
> >
> > --- En date de : Lun 11.1.10, Grant Ingersoll <gsingers@apache.org> a
> > écrit :
> >
> > > De: Grant Ingersoll <gsingers@apache.org>
> > > Objet: Re: Re : Good starting instance for AMI
> > > À: mahout-user@lucene.apache.org
> > > Date: Lundi 11 Janvier 2010, 20h51
> > > One quick question for all who
> > > responded:
> > > How many have tried Mahout with the setup they
> > > recommended?
> > >
> > > -Grant
> > >
> > > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
> > >
> > > > Some comments on Cloudera's Hadoop (CDH) and Elastic
> > > MapReduce (EMR).
> > > >
> > > > I have used both to get hadoop jobs up and running
> > > (although my EMR use has
> > > > mostly been limited to running batch Pig scripts
> > > weekly). Deciding on which
> > > > one to use really depends on what kind of job/data
> > > you're working with.
> > > >
> > > > EMR is most useful if you're already storing the
> > > dataset you're using on S3
> > > > and plan on running a one-off job. My understanding is
> > > that it's configured
> > > > to use jets3t to stream data from s3 rather than
> > > copying it to the cluster,
> > > > which is fine for a single pass over a small to medium
> > > sized dataset, but
> > > > obviously slower for multiple passes or larger
> > > datasets. The API is also
> > > > useful if you have a set workflow that you plan to run
> > > on a regular basis,
> > > > and I often prototype quick and dirty jobs on very
> > > small EMR clusters to
> > > > test how some things run in the wild (obviously not
> > > the most cost effective
> > > > solution, but I've foudn pseudo-distributed mode
> > > doesn't catch everything).
> > > >
> > > > CDH gives you greater control over the initial setup
> > > and configuration of
> > > > your cluster. From my understanding, it's not really
> > > an AMI. Rather, it's a
> > > > set of Python scripts that's been modified from the
> > > ec2 scripts from
> > > > hadoop/contrib with some nifty additions like being
> > > able to specify and set
> > > > up EBS volumes, proxy on the cluster, and some others.
> > > The scripts use the
> > > > boto Python module (a very useful Python module for
> > > working with EC2) to
> > > > make a request to EC2 to setup a specified sized
> > > cluster with whatever
> > > > vanilla AMI that's specified. It sets up the security
> > > groups and opens up
> > > > the relevant ports and it then passes the init script
> > > to each of the
> > > > instances once they've booted (same user-data file
> > > setup which is limited to
> > > > 16K I believe). The init script tells each node to
> > > download hadoop (from
> > > > Clouderas OS-specific repos) and any other
> > > user-specified packages and set
> > > > them up. The hadoop config xml is hardcoded into the
> > > init script (although
> > > > you can pass a modified config beforehand). The master
> > > is started first, and
> > > > then the slaves are started so that the slaves can be
> > > given info about what
> > > > NN and JT to connect to (the config uses the public
> > > DNS I believe to make
> > > > things easier to set up). You can use either 0.18.3
> > > (CDH) or 0.20 (CDH2)
> > > > when it comes to Hadoop versions, although I've had
> > > mixed results with the
> > > > latter.
> > > >
> > > > Personally, I'd still like some kind of facade or
> > > something similar to
> > > > further abstract things and make it easier for others
> > > to quickly set up
> > > > ad-hoc clusters for 'quick n dirty' jobs. I know other
> > > libraries like Crane
> > > > have been released recently, but given the language of
> > > choice (Clojure), I
> > > > haven't yet had a chance to really investigate.
> > > >
> > > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <ted.dunning@gmail.com>
> > > wrote:
> > > >
> > > >> Just use several of these files.
> > > >>
> > > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin
> > > <liangchenmin04@gmail.com
> > > >>> wrote:
> > > >>
> > > >>> EMR requires S3 bucket, but S3 instance has a
> > > limit of file
> > > >>> size(5GB), so need some extra care here. Has
> > > any one encounter the file
> > > >>> size
> > > >>> problem on S3 also? I kind of think that it's
> > > unreasonable to have a  5G
> > > >>> size limit when we want to use the system to
> > > deal with large data set.
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Ted Dunning, CTO
> > > >> DeepDyve
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Zaki Rahaman
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://www.lucidimagination.com/
> > >
> > > Search the Lucene ecosystem using Solr/Lucene:
> > http://www.lucidimagination.com/search
> > >
> > >
> >
> >
> >
> >
>
>
> --
> Chenmin Liang
> Language Technologies Institute, School of Computer Science
> Carnegie Mellon University
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message