Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 14348 invoked from network); 2 Jan 2011 17:59:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Jan 2011 17:59:01 -0000 Received: (qmail 34925 invoked by uid 500); 2 Jan 2011 17:58:58 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 34866 invoked by uid 500); 2 Jan 2011 17:58:58 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 34858 invoked by uid 99); 2 Jan 2011 17:58:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Jan 2011 17:58:58 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.138.91.48] (HELO nm13-vm0.bullet.mail.ne1.yahoo.com) (98.138.91.48) by apache.org (qpsmtpd/0.29) with SMTP; Sun, 02 Jan 2011 17:58:51 +0000 Received: from [98.138.90.51] by nm13.bullet.mail.ne1.yahoo.com with NNFMP; 02 Jan 2011 17:58:30 -0000 Received: from [98.138.89.192] by tm4.bullet.mail.ne1.yahoo.com with NNFMP; 02 Jan 2011 17:58:30 -0000 Received: from [127.0.0.1] by omp1050.mail.ne1.yahoo.com with NNFMP; 02 Jan 2011 17:58:30 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 462905.25300.bm@omp1050.mail.ne1.yahoo.com Received: (qmail 5331 invoked by uid 60001); 2 Jan 2011 17:58:30 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1293991110; bh=/LYgftI54PExqnB35SrZS13etbXA1fDwiHWbeuMKSXo=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=k7MAuy3f2T6PzNDZlDrtIg+usjM+NDiO6rjq3eieSHvxHc4sTO1GpqfrWKsqhx1GFaqt8CNfTuHjBSTJbxhSjB9JfQzAIHC2y5EKc52sTaQcvkRc/TTB7bNE8iTGE3Uvy08xderKQN3/T8yf8FAPQrZ59vdCEyz3FgrukA8otlY= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=URUXkxgt7Z9C1gzmk0iD1EcC3XcMX1CmG7RxOXKyQkRK32yZZ+2AlB4pVRhZOhfHEVDHd8C+uX8Tdoxi6Iu5RNm+ZC7vIX+12G3gV5N5EOOZFkmC4GeybKkqlLBqN5xgAk+Z7eO8O+J3tqAlMBc/vQgdNjH0hAVQbYWmeqAY4IA=; Message-ID: <379486.3982.qm@web120202.mail.ne1.yahoo.com> X-YMail-OSG: H4xwQ3wVM1lz5YZGth0IozZYDA9pMk9KVKM9X0Y49E195EL UBvymFAqOPG4YTd51GPMPFBo5EphRqJufgmvAo1t3cxhjDxTCpORx2oYsewR UIhvYs1MkJCkX1_2eL6wNe9SMhwlGnwKRP_8PC22zECj.lgUViXoWatNw44u _HIHiXg8EY26r1jkYtkg61yukry4I1BvTKDxiSO1zzmwgxYzmrhvFutudQwX _tbhO45t6WkA94XNBOehRL41Ca3.LR0JzUpjOurU_v5k7O2cBaw0SaU53Yw. SnCmGgDMVy.3Ff5VQv7M. Received: from [98.228.56.67] by web120202.mail.ne1.yahoo.com via HTTP; Sun, 02 Jan 2011 09:58:30 PST X-Mailer: YahooMailRC/553 YahooMailWebService/0.8.107.285259 References: <83470.31201.qm@web120210.mail.ne1.yahoo.com> Date: Sun, 2 Jan 2011 09:58:30 -0800 (PST) From: Saurabh Gokhale Subject: Re: Question regarding a System good candidate for Hadoop? To: common-user@hadoop.apache.org In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-2106212535-1293991110=:3982" --0-2106212535-1293991110=:3982 Content-Type: text/plain; charset=us-ascii Thanks all for responding Following are the answers to the questions raised >Can you give more details? How do you do this currently? This tax calculating system as a pre process which reads all the data from the database and creates comma separated flat files. These files are then pass through a map of jobs (series of jobs) where some intermediate output is generated which is feed to the next job in the map. Finally all the generated tax data is updated back into the database into different tables. In this predefined map, some of the independent processes run in parallel. Currently this system runs on a single machine with 64 cores and does not fully utilize the distributed parallel processing framework which I am hoping Hadoop will solve it. >Are the kinds of tasks you do currently easily changeable into map-reduce jobs? They are definitely not easily changeable to Map reduce :(. Since the current logic has multiple jobs (in the map) to produce intermediate outputs (for the next job in line), I do not think a single Map step will be able to produce the required output that is currently produced by say 10 processes in the map. Can I feed 1 map output to the next map input? or 1 Map step can have multiple stages to come to the right output? >How much data do you process per day? Processing wise this is definitely a huge system, the complete process runs for like 2 days (48 hours). Therefore if this is moved to a Hadoop based system then it can definitely be done in less than a day. >Hadoop should be evaluated if your to-process dataset is large Yes the input data is very large and is definitely a good Hadoop input candidate. > If you're going to stick to C, you have two options: > - Hadoop Streaming [Flexible, but uses a pipe] > - Loading a native shared library from the distributed cache. [This ought to >be faster than former] In either cases, will I be able to use my existing C jobs logic? because currently their individual outputs are not similar to what Map steps generates? I would also appreciate if any one has links to any case studies done which I can go over to learn more about real world project got converted to Hadoop. Thanks Saurabh ________________________________ From: Hari Sreekumar To: common-user@hadoop.apache.org Sent: Sun, January 2, 2011 2:38:04 AM Subject: Re: Question regarding a System good candidate for Hadoop? Can you give more details? How do you do this currently? Are the kinds of tasks you do currently easily changeable into map-reduce jobs? How much data do you process per day? hari On Sun, Jan 2, 2011 at 10:01 AM, Harsh J wrote: > Hi, > > Hadoop should be evaluated if your to-process dataset is large (Large > is relative to the size of the cluster you're going to use -- > basically using at least X amount of data such that all the processing > power of your cluster is utilized for at least a good Y period). > > If you're going to stick to C, you have two options: > - Hadoop Streaming [Flexible, but uses a pipe] > - Loading a native shared library from the distributed cache. [This > ought to be faster than former] > >http://hadoop.apache.org/common/docs/current/native_libraries.html#Native+Shared+Libraries >s > > The best benchmark is always your own application. > > -- > Harsh J > www.harshj.com > --0-2106212535-1293991110=:3982--