mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <drew.far...@gmail.com>
Subject Re: Re : Good starting instance for AMI
Date Mon, 18 Jan 2010 15:07:31 GMT
Sounds great.

It might be handy to include with the AMI a local maven repo
pre-populated with build dependencies to shorten the build time as
well.

I wonder if the CDH2 ami's could be used as a starting point? Not sure
if you're allowed to unbundle and modify public AMI's. It would
certainly be more difficult to start from scratch.

Amazon hosts some public datasets for free:
http://aws.amazon.com/publicdatasets/
Perhaps the mahout test data in vector form could be bundled up into a
snapshot that could be re-used by anyone.

On Mon, Jan 18, 2010 at 9:54 AM, Grant Ingersoll <gsingers@apache.org> wrote:
> OK, thanks for all the advice.  I'm wondering if this makes sense:'
>
> Create an AMI with:
> 1. Java 1.6
> 2. Maven
> 3. svn
> 4. Mahout's exact Hadoop version
> 5. A checkout of Mahout
>
> I want to be able to run the trunk version of Mahout with little upgrade pain, both on
an individual node and in a cluster.
>
> Is this the shortest path?  I don't have much experience w/ creating AMIs, but I want
my work to be reusable by the community (remember, committers can get credits from Amazon
for testing Mahout)
>
> After that, I want to convert some of the public datasets to vector format and run some
performance benchmarks.
>
> Thoughts?
>
> On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote:
>
>> I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs Hadoop 0.20+
. Hadoop is pre-installed and configured all I have to do is wget the Mahout's job files and
the data from S3, and launch my job.
>>
>> --- En date de : Mar 12.1.10, deneche abdelhakim <a_deneche@yahoo.fr> a écrit
:
>>
>>> De: deneche abdelhakim <a_deneche@yahoo.fr>
>>> Objet: Re: Re : Good starting instance for AMI
>>> À: mahout-user@lucene.apache.org
>>> Date: Mardi 12 Janvier 2010, 3h44
>>> I used Cloudera's with Mahout to test
>>> the Decision Forest implementation.
>>>
>>> --- En date de : Lun 11.1.10, Grant Ingersoll <gsingers@apache.org>
>>> a écrit :
>>>
>>>> De: Grant Ingersoll <gsingers@apache.org>
>>>> Objet: Re: Re : Good starting instance for AMI
>>>> À: mahout-user@lucene.apache.org
>>>> Date: Lundi 11 Janvier 2010, 20h51
>>>> One quick question for all who
>>>> responded:
>>>> How many have tried Mahout with the setup they
>>>> recommended?
>>>>
>>>> -Grant
>>>>
>>>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
>>>>
>>>>> Some comments on Cloudera's Hadoop (CDH) and
>>> Elastic
>>>> MapReduce (EMR).
>>>>>
>>>>> I have used both to get hadoop jobs up and
>>> running
>>>> (although my EMR use has
>>>>> mostly been limited to running batch Pig scripts
>>>> weekly). Deciding on which
>>>>> one to use really depends on what kind of
>>> job/data
>>>> you're working with.
>>>>>
>>>>> EMR is most useful if you're already storing the
>>>> dataset you're using on S3
>>>>> and plan on running a one-off job. My
>>> understanding is
>>>> that it's configured
>>>>> to use jets3t to stream data from s3 rather than
>>>> copying it to the cluster,
>>>>> which is fine for a single pass over a small to
>>> medium
>>>> sized dataset, but
>>>>> obviously slower for multiple passes or larger
>>>> datasets. The API is also
>>>>> useful if you have a set workflow that you plan
>>> to run
>>>> on a regular basis,
>>>>> and I often prototype quick and dirty jobs on
>>> very
>>>> small EMR clusters to
>>>>> test how some things run in the wild (obviously
>>> not
>>>> the most cost effective
>>>>> solution, but I've foudn pseudo-distributed mode
>>>> doesn't catch everything).
>>>>>
>>>>> CDH gives you greater control over the initial
>>> setup
>>>> and configuration of
>>>>> your cluster. From my understanding, it's not
>>> really
>>>> an AMI. Rather, it's a
>>>>> set of Python scripts that's been modified from
>>> the
>>>> ec2 scripts from
>>>>> hadoop/contrib with some nifty additions like
>>> being
>>>> able to specify and set
>>>>> up EBS volumes, proxy on the cluster, and some
>>> others.
>>>> The scripts use the
>>>>> boto Python module (a very useful Python module
>>> for
>>>> working with EC2) to
>>>>> make a request to EC2 to setup a specified sized
>>>> cluster with whatever
>>>>> vanilla AMI that's specified. It sets up the
>>> security
>>>> groups and opens up
>>>>> the relevant ports and it then passes the init
>>> script
>>>> to each of the
>>>>> instances once they've booted (same user-data
>>> file
>>>> setup which is limited to
>>>>> 16K I believe). The init script tells each node
>>> to
>>>> download hadoop (from
>>>>> Clouderas OS-specific repos) and any other
>>>> user-specified packages and set
>>>>> them up. The hadoop config xml is hardcoded into
>>> the
>>>> init script (although
>>>>> you can pass a modified config beforehand). The
>>> master
>>>> is started first, and
>>>>> then the slaves are started so that the slaves
>>> can be
>>>> given info about what
>>>>> NN and JT to connect to (the config uses the
>>> public
>>>> DNS I believe to make
>>>>> things easier to set up). You can use either
>>> 0.18.3
>>>> (CDH) or 0.20 (CDH2)
>>>>> when it comes to Hadoop versions, although I've
>>> had
>>>> mixed results with the
>>>>> latter.
>>>>>
>>>>> Personally, I'd still like some kind of facade
>>> or
>>>> something similar to
>>>>> further abstract things and make it easier for
>>> others
>>>> to quickly set up
>>>>> ad-hoc clusters for 'quick n dirty' jobs. I know
>>> other
>>>> libraries like Crane
>>>>> have been released recently, but given the
>>> language of
>>>> choice (Clojure), I
>>>>> haven't yet had a chance to really investigate.
>>>>>
>>>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
>>> <ted.dunning@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Just use several of these files.
>>>>>>
>>>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang
>>> Chenmin
>>>> <liangchenmin04@gmail.com
>>>>>>> wrote:
>>>>>>
>>>>>>> EMR requires S3 bucket, but S3 instance
>>> has a
>>>> limit of file
>>>>>>> size(5GB), so need some extra care here.
>>> Has
>>>> any one encounter the file
>>>>>>> size
>>>>>>> problem on S3 also? I kind of think that
>>> it's
>>>> unreasonable to have a  5G
>>>>>>> size limit when we want to use the system
>>> to
>>>> deal with large data set.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ted Dunning, CTO
>>>>>> DeepDyve
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Zaki Rahaman
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>

Mime
View raw message