Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 89875 invoked from network); 12 Jan 2010 03:33:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Jan 2010 03:33:55 -0000 Received: (qmail 12033 invoked by uid 500); 12 Jan 2010 03:33:54 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 11989 invoked by uid 500); 12 Jan 2010 03:33:54 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 11979 invoked by uid 99); 12 Jan 2010 03:33:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 03:33:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of robin.anil@gmail.com designates 209.85.160.46 as permitted sender) Received: from [209.85.160.46] (HELO mail-pw0-f46.google.com) (209.85.160.46) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 03:33:45 +0000 Received: by pwi11 with SMTP id 11so1964969pwi.5 for ; Mon, 11 Jan 2010 19:33:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=gC+AeylIqcU0NCMCaTyRlZnELJ6T8FBl334b7yZQQbA=; b=lqLPdULEgxwO9mFV/wRTQOobygB0c8vEwQG50nS4FfEtNx3tsGPQ8v2s0KUAZq0E25 J+XCPBpMfzm3DKKidjou/8zQG5CgP/4h6HrRBcO/RPZJF8qHHhzym3cN3PB6jsneweg0 FHtfC/Ph25mvyTIqZQbXgGAkjb3sm4zVeVoes= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=MqDxyWptM604ZHZO+snoEtIwj4HCOmlazFt4HHgyh0ha7RevTK5upnsQXQufeO6mkL KE/PFbG91D9Xx12TekBBRRJmydtOR36/Ggu6LfRphPRV/R3P/3OdvMiJmeVtWztrNkzT a70cG2H+7fFiIqn/mRMUif3Gth9i2XCGjcEQU= MIME-Version: 1.0 Received: by 10.140.247.16 with SMTP id u16mr22699508rvh.124.1263267205449; Mon, 11 Jan 2010 19:33:25 -0800 (PST) In-Reply-To: <4306e65c1001111925v3cee6eeeo14cebe03b062cce0@mail.gmail.com> References: <67F5761E-184C-4AB9-9B3A-3F14A3A7AB67@apache.org> <469620.18861.qm@web26304.mail.ukl.yahoo.com> <4306e65c1001111925v3cee6eeeo14cebe03b062cce0@mail.gmail.com> From: Robin Anil Date: Tue, 12 Jan 2010 09:03:05 +0530 Message-ID: <7d7600c51001111933x272e8580k7c25f47c28703eb@mail.gmail.com> Subject: Re: Re : Good starting instance for AMI To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=000e0cd1056209b22c047cef5138 --000e0cd1056209b22c047cef5138 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Since i dont have a personal linux box these days. I code on eclipse on windows and fire up an instance attach the ebs and patch and test my code. yes, I have only tried a single node yet. On Tue, Jan 12, 2010 at 8:55 AM, Liang Chenmin wr= ote: > I first followed the tutorial about running mahout on EMR, need some > revision on the command line though. > > On Mon, Jan 11, 2010 at 6:44 PM, deneche abdelhakim >wrote: > > > I used Cloudera's with Mahout to test the Decision Forest implementatio= n. > > > > --- En date de : Lun 11.1.10, Grant Ingersoll a > > =C3=A9crit : > > > > > De: Grant Ingersoll > > > Objet: Re: Re : Good starting instance for AMI > > > =C3=80: mahout-user@lucene.apache.org > > > Date: Lundi 11 Janvier 2010, 20h51 > > > One quick question for all who > > > responded: > > > How many have tried Mahout with the setup they > > > recommended? > > > > > > -Grant > > > > > > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > > > > > > > Some comments on Cloudera's Hadoop (CDH) and Elastic > > > MapReduce (EMR). > > > > > > > > I have used both to get hadoop jobs up and running > > > (although my EMR use has > > > > mostly been limited to running batch Pig scripts > > > weekly). Deciding on which > > > > one to use really depends on what kind of job/data > > > you're working with. > > > > > > > > EMR is most useful if you're already storing the > > > dataset you're using on S3 > > > > and plan on running a one-off job. My understanding is > > > that it's configured > > > > to use jets3t to stream data from s3 rather than > > > copying it to the cluster, > > > > which is fine for a single pass over a small to medium > > > sized dataset, but > > > > obviously slower for multiple passes or larger > > > datasets. The API is also > > > > useful if you have a set workflow that you plan to run > > > on a regular basis, > > > > and I often prototype quick and dirty jobs on very > > > small EMR clusters to > > > > test how some things run in the wild (obviously not > > > the most cost effective > > > > solution, but I've foudn pseudo-distributed mode > > > doesn't catch everything). > > > > > > > > CDH gives you greater control over the initial setup > > > and configuration of > > > > your cluster. From my understanding, it's not really > > > an AMI. Rather, it's a > > > > set of Python scripts that's been modified from the > > > ec2 scripts from > > > > hadoop/contrib with some nifty additions like being > > > able to specify and set > > > > up EBS volumes, proxy on the cluster, and some others. > > > The scripts use the > > > > boto Python module (a very useful Python module for > > > working with EC2) to > > > > make a request to EC2 to setup a specified sized > > > cluster with whatever > > > > vanilla AMI that's specified. It sets up the security > > > groups and opens up > > > > the relevant ports and it then passes the init script > > > to each of the > > > > instances once they've booted (same user-data file > > > setup which is limited to > > > > 16K I believe). The init script tells each node to > > > download hadoop (from > > > > Clouderas OS-specific repos) and any other > > > user-specified packages and set > > > > them up. The hadoop config xml is hardcoded into the > > > init script (although > > > > you can pass a modified config beforehand). The master > > > is started first, and > > > > then the slaves are started so that the slaves can be > > > given info about what > > > > NN and JT to connect to (the config uses the public > > > DNS I believe to make > > > > things easier to set up). You can use either 0.18.3 > > > (CDH) or 0.20 (CDH2) > > > > when it comes to Hadoop versions, although I've had > > > mixed results with the > > > > latter. > > > > > > > > Personally, I'd still like some kind of facade or > > > something similar to > > > > further abstract things and make it easier for others > > > to quickly set up > > > > ad-hoc clusters for 'quick n dirty' jobs. I know other > > > libraries like Crane > > > > have been released recently, but given the language of > > > choice (Clojure), I > > > > haven't yet had a chance to really investigate. > > > > > > > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning > > > wrote: > > > > > > > >> Just use several of these files. > > > >> > > > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin > > > > > >>> wrote: > > > >> > > > >>> EMR requires S3 bucket, but S3 instance has a > > > limit of file > > > >>> size(5GB), so need some extra care here. Has > > > any one encounter the file > > > >>> size > > > >>> problem on S3 also? I kind of think that it's > > > unreasonable to have a 5G > > > >>> size limit when we want to use the system to > > > deal with large data set. > > > >>> > > > >> > > > >> > > > >> > > > >> -- > > > >> Ted Dunning, CTO > > > >> DeepDyve > > > >> > > > > > > > > > > > > > > > > -- > > > > Zaki Rahaman > > > > > > -------------------------- > > > Grant Ingersoll > > > http://www.lucidimagination.com/ > > > > > > Search the Lucene ecosystem using Solr/Lucene: > > http://www.lucidimagination.com/search > > > > > > > > > > > > > > > > > -- > Chenmin Liang > Language Technologies Institute, School of Computer Science > Carnegie Mellon University > --000e0cd1056209b22c047cef5138--