Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type;
  b=URUXkxgt7Z9C1gzmk0iD1EcC3XcMX1CmG7RxOXKyQkRK32yZZ+2AlB4pVRhZOhfHEVDHd8C+uX8Tdoxi6Iu5RNm+ZC7vIX+12G3gV5N5EOOZFkmC4GeybKkqlLBqN5xgAk+Z7eO8O+J3tqAlMBc/vQgdNjH0hAVQbYWmeqAY4IA=;
Message-ID: <379486.3982.qm@web120202.mail.ne1.yahoo.com>
References: <83470.31201.qm@web120210.mail.ne1.yahoo.com>
 <AANLkTi=_HvKbk=2EYGgXO1NOh8CHaCJ=vj2cHVpsDH88@mail.gmail.com>
 <AANLkTi=++PtLcBWkfDCZ=Wj-8FDzt8LSTZjORoSazEgB@mail.gmail.com>
Date: Sun, 2 Jan 2011 09:58:30 -0800 (PST)
From: Saurabh Gokhale <saurabh_gokhale@yahoo.com>
Subject: Re: Question regarding a System good candidate for Hadoop?
To: common-user@hadoop.apache.org
In-Reply-To: <AANLkTi=++PtLcBWkfDCZ=Wj-8FDzt8LSTZjORoSazEgB@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="0-2106212535-1293991110=:3982"

--0-2106212535-1293991110=:3982
Content-Type: text/plain; charset=us-ascii

Thanks all for responding

Following are the answers to the questions raised

>Can you give more details? How do you do this currently?
This tax calculating system as a pre process which reads all the data from the 
database and creates comma separated flat files. These files are then pass 
through a map of jobs (series of jobs) where some intermediate output is 
generated which is feed to the next job in the map. Finally all the generated 
tax data is updated back into the database into different tables. In this 
predefined map, some of the independent processes run in parallel. Currently 
this system runs on a single machine with 64 cores and does not fully utilize 
the distributed parallel processing framework which I am hoping Hadoop 
will solve it.


>Are the kinds of tasks you do currently easily changeable into map-reduce 
jobs? 
They are definitely not easily changeable to Map reduce :(. Since the current 
logic has multiple jobs (in the map) to produce intermediate outputs (for the 
next job in line), I do not think a single Map step will be able to produce the 
required output that is currently produced by say 10 processes in the map.
Can I feed 1 map output to the next map input? or 1 Map step can have multiple 
stages to come to the right output?


>How much data do you process per day?
Processing wise this is definitely a huge system, the complete process runs for 
like 2 days (48 hours). Therefore if this is moved to a Hadoop based system then 
it can definitely be done in less than a day.

>Hadoop should be evaluated if your to-process dataset is large
Yes the input data is very large and is definitely a good Hadoop input 
candidate.


> If you're going to stick to C, you have two options:
>  - Hadoop Streaming [Flexible, but uses a pipe]
>  - Loading a native shared library from the distributed cache. [This ought to 
>be faster than former]
In either cases, will I be able to use my existing C jobs logic? because 
currently their individual outputs are not similar to what Map steps generates?

I would also appreciate if any one has links to any case studies done which I 
can go over to learn more about real world project got converted to Hadoop.

Thanks

Saurabh


________________________________
From: Hari Sreekumar <hsreekumar@clickable.com>
To: common-user@hadoop.apache.org
Sent: Sun, January 2, 2011 2:38:04 AM
Subject: Re: Question regarding a System good candidate for Hadoop?

Can you give more details? How do you do this currently? Are the kinds of
tasks you do currently easily changeable into map-reduce jobs? How much data
do you process per day?

hari

On Sun, Jan 2, 2011 at 10:01 AM, Harsh J <qwertymaniac@gmail.com> wrote:

> Hi,
>
> Hadoop should be evaluated if your to-process dataset is large (Large
> is relative to the size of the cluster you're going to use --
> basically using at least X amount of data such that all the processing
> power of your cluster is utilized for at least a good Y period).
>
> If you're going to stick to C, you have two options:
>  - Hadoop Streaming [Flexible, but uses a pipe]
>  - Loading a native shared library from the distributed cache. [This
> ought to be faster than former]
>
>http://hadoop.apache.org/common/docs/current/native_libraries.html#Native+Shared+Libraries
>s
>
> The best benchmark is always your own application.
>
> --
> Harsh J
> www.harshj.com
>


--0-2106212535-1293991110=:3982--