Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of drew.farris@gmail.com
 designates 209.85.219.225 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=UknqLkdWL+j3ltTIH0gHlkKFKUAZQWm5vFdBCOA9W3rYy7Rr7N1u3i7Fix02gwMg1H
         lTa+02cUNL3Jyi7Hh8gJ847ASnB3AkIIy1m3yTOjEDFWUJp8IHXYpgZ0E83EVp7KQMz/
         wlui0pIpDjfh3SXQ8PhP5L89OhJLCqCTv06kk=
MIME-Version: 1.0
In-Reply-To: <A4782AD1-9693-459F-B883-DCE2A535F9A2@apache.org>
References: <629059.62746.qm@web26303.mail.ukl.yahoo.com>
	 <A4782AD1-9693-459F-B883-DCE2A535F9A2@apache.org>
Date: Mon, 18 Jan 2010 10:07:31 -0500
Message-ID: <8f8e14c41001180707o283210d3o76e5adda296757dd@mail.gmail.com>
Subject: Re: Re : Good starting instance for AMI
From: Drew Farris <drew.farris@gmail.com>
To: mahout-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Sounds great.

It might be handy to include with the AMI a local maven repo
pre-populated with build dependencies to shorten the build time as
well.

I wonder if the CDH2 ami's could be used as a starting point? Not sure
if you're allowed to unbundle and modify public AMI's. It would
certainly be more difficult to start from scratch.

Amazon hosts some public datasets for free:
http://aws.amazon.com/publicdatasets/
Perhaps the mahout test data in vector form could be bundled up into a
snapshot that could be re-used by anyone.

On Mon, Jan 18, 2010 at 9:54 AM, Grant Ingersoll <gsingers@apache.org> wrot=
e:
> OK, thanks for all the advice. =A0I'm wondering if this makes sense:'
>
> Create an AMI with:
> 1. Java 1.6
> 2. Maven
> 3. svn
> 4. Mahout's exact Hadoop version
> 5. A checkout of Mahout
>
> I want to be able to run the trunk version of Mahout with little upgrade =
pain, both on an individual node and in a cluster.
>
> Is this the shortest path? =A0I don't have much experience w/ creating AM=
Is, but I want my work to be reusable by the community (remember, committer=
s can get credits from Amazon for testing Mahout)
>
> After that, I want to convert some of the public datasets to vector forma=
t and run some performance benchmarks.
>
> Thoughts?
>
> On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote:
>
>> I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs =
Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is w=
get the Mahout's job files and the data from S3, and launch my job.
>>
>> --- En date de : Mar 12.1.10, deneche abdelhakim <a_deneche@yahoo.fr> a =
=E9crit :
>>
>>> De: deneche abdelhakim <a_deneche@yahoo.fr>
>>> Objet: Re: Re : Good starting instance for AMI
>>> =C0: mahout-user@lucene.apache.org
>>> Date: Mardi 12 Janvier 2010, 3h44
>>> I used Cloudera's with Mahout to test
>>> the Decision Forest implementation.
>>>
>>> --- En date de : Lun 11.1.10, Grant Ingersoll <gsingers@apache.org>
>>> a =E9crit :
>>>
>>>> De: Grant Ingersoll <gsingers@apache.org>
>>>> Objet: Re: Re : Good starting instance for AMI
>>>> =C0: mahout-user@lucene.apache.org
>>>> Date: Lundi 11 Janvier 2010, 20h51
>>>> One quick question for all who
>>>> responded:
>>>> How many have tried Mahout with the setup they
>>>> recommended?
>>>>
>>>> -Grant
>>>>
>>>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
>>>>
>>>>> Some comments on Cloudera's Hadoop (CDH) and
>>> Elastic
>>>> MapReduce (EMR).
>>>>>
>>>>> I have used both to get hadoop jobs up and
>>> running
>>>> (although my EMR use has
>>>>> mostly been limited to running batch Pig scripts
>>>> weekly). Deciding on which
>>>>> one to use really depends on what kind of
>>> job/data
>>>> you're working with.
>>>>>
>>>>> EMR is most useful if you're already storing the
>>>> dataset you're using on S3
>>>>> and plan on running a one-off job. My
>>> understanding is
>>>> that it's configured
>>>>> to use jets3t to stream data from s3 rather than
>>>> copying it to the cluster,
>>>>> which is fine for a single pass over a small to
>>> medium
>>>> sized dataset, but
>>>>> obviously slower for multiple passes or larger
>>>> datasets. The API is also
>>>>> useful if you have a set workflow that you plan
>>> to run
>>>> on a regular basis,
>>>>> and I often prototype quick and dirty jobs on
>>> very
>>>> small EMR clusters to
>>>>> test how some things run in the wild (obviously
>>> not
>>>> the most cost effective
>>>>> solution, but I've foudn pseudo-distributed mode
>>>> doesn't catch everything).
>>>>>
>>>>> CDH gives you greater control over the initial
>>> setup
>>>> and configuration of
>>>>> your cluster. From my understanding, it's not
>>> really
>>>> an AMI. Rather, it's a
>>>>> set of Python scripts that's been modified from
>>> the
>>>> ec2 scripts from
>>>>> hadoop/contrib with some nifty additions like
>>> being
>>>> able to specify and set
>>>>> up EBS volumes, proxy on the cluster, and some
>>> others.
>>>> The scripts use the
>>>>> boto Python module (a very useful Python module
>>> for
>>>> working with EC2) to
>>>>> make a request to EC2 to setup a specified sized
>>>> cluster with whatever
>>>>> vanilla AMI that's specified. It sets up the
>>> security
>>>> groups and opens up
>>>>> the relevant ports and it then passes the init
>>> script
>>>> to each of the
>>>>> instances once they've booted (same user-data
>>> file
>>>> setup which is limited to
>>>>> 16K I believe). The init script tells each node
>>> to
>>>> download hadoop (from
>>>>> Clouderas OS-specific repos) and any other
>>>> user-specified packages and set
>>>>> them up. The hadoop config xml is hardcoded into
>>> the
>>>> init script (although
>>>>> you can pass a modified config beforehand). The
>>> master
>>>> is started first, and
>>>>> then the slaves are started so that the slaves
>>> can be
>>>> given info about what
>>>>> NN and JT to connect to (the config uses the
>>> public
>>>> DNS I believe to make
>>>>> things easier to set up). You can use either
>>> 0.18.3
>>>> (CDH) or 0.20 (CDH2)
>>>>> when it comes to Hadoop versions, although I've
>>> had
>>>> mixed results with the
>>>>> latter.
>>>>>
>>>>> Personally, I'd still like some kind of facade
>>> or
>>>> something similar to
>>>>> further abstract things and make it easier for
>>> others
>>>> to quickly set up
>>>>> ad-hoc clusters for 'quick n dirty' jobs. I know
>>> other
>>>> libraries like Crane
>>>>> have been released recently, but given the
>>> language of
>>>> choice (Clojure), I
>>>>> haven't yet had a chance to really investigate.
>>>>>
>>>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
>>> <ted.dunning@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Just use several of these files.
>>>>>>
>>>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang
>>> Chenmin
>>>> <liangchenmin04@gmail.com
>>>>>>> wrote:
>>>>>>
>>>>>>> EMR requires S3 bucket, but S3 instance
>>> has a
>>>> limit of file
>>>>>>> size(5GB), so need some extra care here.
>>> Has
>>>> any one encounter the file
>>>>>>> size
>>>>>>> problem on S3 also? I kind of think that
>>> it's
>>>> unreasonable to have a =A05G
>>>>>>> size limit when we want to use the system
>>> to
>>>> deal with large data set.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ted Dunning, CTO
>>>>>> DeepDyve
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Zaki Rahaman
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagina=
tion.com/search
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>