mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Re : Good starting instance for AMI
Date Mon, 18 Jan 2010 15:20:40 GMT
Perfect!. We can have two ami's. Mahout trunk and mahout release version.


On Mon, Jan 18, 2010 at 8:24 PM, Grant Ingersoll <gsingers@apache.org>wrote:

> OK, thanks for all the advice.  I'm wondering if this makes sense:'
>
> Create an AMI with:
> 1. Java 1.6
> 2. Maven
> 3. svn
> 4. Mahout's exact Hadoop version
> 5. A checkout of Mahout
>
> I want to be able to run the trunk version of Mahout with little upgrade
> pain, both on an individual node and in a cluster.
>
> Is this the shortest path?  I don't have much experience w/ creating AMIs,
> but I want my work to be reusable by the community (remember, committers can
> get credits from Amazon for testing Mahout)
>
> After that, I want to convert some of the public datasets to vector format
> and run some performance benchmarks.
>
> Thoughts?
>
> On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote:
>
> > I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs
> Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is
> wget the Mahout's job files and the data from S3, and launch my job.
> >
> > --- En date de : Mar 12.1.10, deneche abdelhakim <a_deneche@yahoo.fr> a
> écrit :
> >
> >> De: deneche abdelhakim <a_deneche@yahoo.fr>
> >> Objet: Re: Re : Good starting instance for AMI
> >> À: mahout-user@lucene.apache.org
> >> Date: Mardi 12 Janvier 2010, 3h44
> >> I used Cloudera's with Mahout to test
> >> the Decision Forest implementation.
> >>
> >> --- En date de : Lun 11.1.10, Grant Ingersoll <gsingers@apache.org>
> >> a écrit :
> >>
> >>> De: Grant Ingersoll <gsingers@apache.org>
> >>> Objet: Re: Re : Good starting instance for AMI
> >>> À: mahout-user@lucene.apache.org
> >>> Date: Lundi 11 Janvier 2010, 20h51
> >>> One quick question for all who
> >>> responded:
> >>> How many have tried Mahout with the setup they
> >>> recommended?
> >>>
> >>> -Grant
> >>>
> >>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
> >>>
> >>>> Some comments on Cloudera's Hadoop (CDH) and
> >> Elastic
> >>> MapReduce (EMR).
> >>>>
> >>>> I have used both to get hadoop jobs up and
> >> running
> >>> (although my EMR use has
> >>>> mostly been limited to running batch Pig scripts
> >>> weekly). Deciding on which
> >>>> one to use really depends on what kind of
> >> job/data
> >>> you're working with.
> >>>>
> >>>> EMR is most useful if you're already storing the
> >>> dataset you're using on S3
> >>>> and plan on running a one-off job. My
> >> understanding is
> >>> that it's configured
> >>>> to use jets3t to stream data from s3 rather than
> >>> copying it to the cluster,
> >>>> which is fine for a single pass over a small to
> >> medium
> >>> sized dataset, but
> >>>> obviously slower for multiple passes or larger
> >>> datasets. The API is also
> >>>> useful if you have a set workflow that you plan
> >> to run
> >>> on a regular basis,
> >>>> and I often prototype quick and dirty jobs on
> >> very
> >>> small EMR clusters to
> >>>> test how some things run in the wild (obviously
> >> not
> >>> the most cost effective
> >>>> solution, but I've foudn pseudo-distributed mode
> >>> doesn't catch everything).
> >>>>
> >>>> CDH gives you greater control over the initial
> >> setup
> >>> and configuration of
> >>>> your cluster. From my understanding, it's not
> >> really
> >>> an AMI. Rather, it's a
> >>>> set of Python scripts that's been modified from
> >> the
> >>> ec2 scripts from
> >>>> hadoop/contrib with some nifty additions like
> >> being
> >>> able to specify and set
> >>>> up EBS volumes, proxy on the cluster, and some
> >> others.
> >>> The scripts use the
> >>>> boto Python module (a very useful Python module
> >> for
> >>> working with EC2) to
> >>>> make a request to EC2 to setup a specified sized
> >>> cluster with whatever
> >>>> vanilla AMI that's specified. It sets up the
> >> security
> >>> groups and opens up
> >>>> the relevant ports and it then passes the init
> >> script
> >>> to each of the
> >>>> instances once they've booted (same user-data
> >> file
> >>> setup which is limited to
> >>>> 16K I believe). The init script tells each node
> >> to
> >>> download hadoop (from
> >>>> Clouderas OS-specific repos) and any other
> >>> user-specified packages and set
> >>>> them up. The hadoop config xml is hardcoded into
> >> the
> >>> init script (although
> >>>> you can pass a modified config beforehand). The
> >> master
> >>> is started first, and
> >>>> then the slaves are started so that the slaves
> >> can be
> >>> given info about what
> >>>> NN and JT to connect to (the config uses the
> >> public
> >>> DNS I believe to make
> >>>> things easier to set up). You can use either
> >> 0.18.3
> >>> (CDH) or 0.20 (CDH2)
> >>>> when it comes to Hadoop versions, although I've
> >> had
> >>> mixed results with the
> >>>> latter.
> >>>>
> >>>> Personally, I'd still like some kind of facade
> >> or
> >>> something similar to
> >>>> further abstract things and make it easier for
> >> others
> >>> to quickly set up
> >>>> ad-hoc clusters for 'quick n dirty' jobs. I know
> >> other
> >>> libraries like Crane
> >>>> have been released recently, but given the
> >> language of
> >>> choice (Clojure), I
> >>>> haven't yet had a chance to really investigate.
> >>>>
> >>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
> >> <ted.dunning@gmail.com>
> >>> wrote:
> >>>>
> >>>>> Just use several of these files.
> >>>>>
> >>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang
> >> Chenmin
> >>> <liangchenmin04@gmail.com
> >>>>>> wrote:
> >>>>>
> >>>>>> EMR requires S3 bucket, but S3 instance
> >> has a
> >>> limit of file
> >>>>>> size(5GB), so need some extra care here.
> >> Has
> >>> any one encounter the file
> >>>>>> size
> >>>>>> problem on S3 also? I kind of think that
> >> it's
> >>> unreasonable to have a  5G
> >>>>>> size limit when we want to use the system
> >> to
> >>> deal with large data set.
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Ted Dunning, CTO
> >>>>> DeepDyve
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Zaki Rahaman
> >>>
> >>> --------------------------
> >>> Grant Ingersoll
> >>> http://www.lucidimagination.com/
> >>>
> >>> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
> >>>
> >>>
> >>
> >>
> >>
> >>
> >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message