Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 44523 invoked from network); 18 Jan 2010 15:08:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Jan 2010 15:08:02 -0000 Received: (qmail 40020 invoked by uid 500); 18 Jan 2010 15:08:01 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 39972 invoked by uid 500); 18 Jan 2010 15:08:01 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 39962 invoked by uid 99); 18 Jan 2010 15:08:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Jan 2010 15:08:01 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of drew.farris@gmail.com designates 209.85.219.225 as permitted sender) Received: from [209.85.219.225] (HELO mail-ew0-f225.google.com) (209.85.219.225) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Jan 2010 15:07:53 +0000 Received: by ewy25 with SMTP id 25so685065ewy.5 for ; Mon, 18 Jan 2010 07:07:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=XFTpTf2iYaHrRP7ZvwH78c5BwFuXoaQwEcleGWj9e4c=; b=LtAfDOsDEvAfa5HyZ5vdIC5wTcod9zYQBBhWOCGNu0BZfV99S3PdFW3F86aGYwhiZb 66J1UUy4b8C+U4bD4LGqXeUhLThEMDT1Q1SSBrcr2miIxRLiOIZ1jPvCSUuCEa97pHJG aSGWfYaSNQ2nEydCvpOl+mwIH5d8UGE+W/axI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=UknqLkdWL+j3ltTIH0gHlkKFKUAZQWm5vFdBCOA9W3rYy7Rr7N1u3i7Fix02gwMg1H lTa+02cUNL3Jyi7Hh8gJ847ASnB3AkIIy1m3yTOjEDFWUJp8IHXYpgZ0E83EVp7KQMz/ wlui0pIpDjfh3SXQ8PhP5L89OhJLCqCTv06kk= MIME-Version: 1.0 Received: by 10.213.39.140 with SMTP id g12mr3908891ebe.38.1263827252057; Mon, 18 Jan 2010 07:07:32 -0800 (PST) In-Reply-To: References: <629059.62746.qm@web26303.mail.ukl.yahoo.com> Date: Mon, 18 Jan 2010 10:07:31 -0500 Message-ID: <8f8e14c41001180707o283210d3o76e5adda296757dd@mail.gmail.com> Subject: Re: Re : Good starting instance for AMI From: Drew Farris To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sounds great. It might be handy to include with the AMI a local maven repo pre-populated with build dependencies to shorten the build time as well. I wonder if the CDH2 ami's could be used as a starting point? Not sure if you're allowed to unbundle and modify public AMI's. It would certainly be more difficult to start from scratch. Amazon hosts some public datasets for free: http://aws.amazon.com/publicdatasets/ Perhaps the mahout test data in vector form could be bundled up into a snapshot that could be re-used by anyone. On Mon, Jan 18, 2010 at 9:54 AM, Grant Ingersoll wrot= e: > OK, thanks for all the advice. =A0I'm wondering if this makes sense:' > > Create an AMI with: > 1. Java 1.6 > 2. Maven > 3. svn > 4. Mahout's exact Hadoop version > 5. A checkout of Mahout > > I want to be able to run the trunk version of Mahout with little upgrade = pain, both on an individual node and in a cluster. > > Is this the shortest path? =A0I don't have much experience w/ creating AM= Is, but I want my work to be reusable by the community (remember, committer= s can get credits from Amazon for testing Mahout) > > After that, I want to convert some of the public datasets to vector forma= t and run some performance benchmarks. > > Thoughts? > > On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote: > >> I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs = Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is w= get the Mahout's job files and the data from S3, and launch my job. >> >> --- En date de : Mar 12.1.10, deneche abdelhakim a = =E9crit : >> >>> De: deneche abdelhakim >>> Objet: Re: Re : Good starting instance for AMI >>> =C0: mahout-user@lucene.apache.org >>> Date: Mardi 12 Janvier 2010, 3h44 >>> I used Cloudera's with Mahout to test >>> the Decision Forest implementation. >>> >>> --- En date de : Lun 11.1.10, Grant Ingersoll >>> a =E9crit : >>> >>>> De: Grant Ingersoll >>>> Objet: Re: Re : Good starting instance for AMI >>>> =C0: mahout-user@lucene.apache.org >>>> Date: Lundi 11 Janvier 2010, 20h51 >>>> One quick question for all who >>>> responded: >>>> How many have tried Mahout with the setup they >>>> recommended? >>>> >>>> -Grant >>>> >>>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: >>>> >>>>> Some comments on Cloudera's Hadoop (CDH) and >>> Elastic >>>> MapReduce (EMR). >>>>> >>>>> I have used both to get hadoop jobs up and >>> running >>>> (although my EMR use has >>>>> mostly been limited to running batch Pig scripts >>>> weekly). Deciding on which >>>>> one to use really depends on what kind of >>> job/data >>>> you're working with. >>>>> >>>>> EMR is most useful if you're already storing the >>>> dataset you're using on S3 >>>>> and plan on running a one-off job. My >>> understanding is >>>> that it's configured >>>>> to use jets3t to stream data from s3 rather than >>>> copying it to the cluster, >>>>> which is fine for a single pass over a small to >>> medium >>>> sized dataset, but >>>>> obviously slower for multiple passes or larger >>>> datasets. The API is also >>>>> useful if you have a set workflow that you plan >>> to run >>>> on a regular basis, >>>>> and I often prototype quick and dirty jobs on >>> very >>>> small EMR clusters to >>>>> test how some things run in the wild (obviously >>> not >>>> the most cost effective >>>>> solution, but I've foudn pseudo-distributed mode >>>> doesn't catch everything). >>>>> >>>>> CDH gives you greater control over the initial >>> setup >>>> and configuration of >>>>> your cluster. From my understanding, it's not >>> really >>>> an AMI. Rather, it's a >>>>> set of Python scripts that's been modified from >>> the >>>> ec2 scripts from >>>>> hadoop/contrib with some nifty additions like >>> being >>>> able to specify and set >>>>> up EBS volumes, proxy on the cluster, and some >>> others. >>>> The scripts use the >>>>> boto Python module (a very useful Python module >>> for >>>> working with EC2) to >>>>> make a request to EC2 to setup a specified sized >>>> cluster with whatever >>>>> vanilla AMI that's specified. It sets up the >>> security >>>> groups and opens up >>>>> the relevant ports and it then passes the init >>> script >>>> to each of the >>>>> instances once they've booted (same user-data >>> file >>>> setup which is limited to >>>>> 16K I believe). The init script tells each node >>> to >>>> download hadoop (from >>>>> Clouderas OS-specific repos) and any other >>>> user-specified packages and set >>>>> them up. The hadoop config xml is hardcoded into >>> the >>>> init script (although >>>>> you can pass a modified config beforehand). The >>> master >>>> is started first, and >>>>> then the slaves are started so that the slaves >>> can be >>>> given info about what >>>>> NN and JT to connect to (the config uses the >>> public >>>> DNS I believe to make >>>>> things easier to set up). You can use either >>> 0.18.3 >>>> (CDH) or 0.20 (CDH2) >>>>> when it comes to Hadoop versions, although I've >>> had >>>> mixed results with the >>>>> latter. >>>>> >>>>> Personally, I'd still like some kind of facade >>> or >>>> something similar to >>>>> further abstract things and make it easier for >>> others >>>> to quickly set up >>>>> ad-hoc clusters for 'quick n dirty' jobs. I know >>> other >>>> libraries like Crane >>>>> have been released recently, but given the >>> language of >>>> choice (Clojure), I >>>>> haven't yet had a chance to really investigate. >>>>> >>>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning >>> >>>> wrote: >>>>> >>>>>> Just use several of these files. >>>>>> >>>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang >>> Chenmin >>>> >>>>>> wrote: >>>>>> >>>>>>> EMR requires S3 bucket, but S3 instance >>> has a >>>> limit of file >>>>>>> size(5GB), so need some extra care here. >>> Has >>>> any one encounter the file >>>>>>> size >>>>>>> problem on S3 also? I kind of think that >>> it's >>>> unreasonable to have a =A05G >>>>>>> size limit when we want to use the system >>> to >>>> deal with large data set. >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ted Dunning, CTO >>>>>> DeepDyve >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Zaki Rahaman >>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> http://www.lucidimagination.com/ >>>> >>>> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagina= tion.com/search >>>> >>>> >>> >>> >>> >>> >> >> >> > >