Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 76C42DCF0 for ; Wed, 28 Nov 2012 06:56:30 +0000 (UTC) Received: (qmail 54341 invoked by uid 500); 28 Nov 2012 06:56:25 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 54205 invoked by uid 500); 28 Nov 2012 06:56:25 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 54177 invoked by uid 99); 28 Nov 2012 06:56:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 06:56:24 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of balijamahesh.mca@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qa0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 06:56:16 +0000 Received: by mail-qa0-f48.google.com with SMTP id s11so5430316qaa.14 for ; Tue, 27 Nov 2012 22:55:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=iecBMz7WH9Z6k5IJA5tguoPvp5kFIFm6HtK+oKi4Qfo=; b=CSLIomIEMq46G3vsH5ikJ6lBdvITPK8W3Hf1A70dmpsdV7gCysVUR30CLNYToq45ry YuW4nq4ion88s8XBQbi9CGpqF+veMyzDnWpicTdZQlyuQ+YqTPMMexPZIrma+MhxYhFW qxpY6MI69supeWwoaA2v/Ivwt++3nybZ1CR6KjnpoubnRHp86S2e5V7IGIWJX9Go8Kmj kAgvBLjMeF4sCzMWyixfk18IDka1lw9ZHbG6iWitUlPVIpoJ2t3szJTYt16uEKUjguNb 4uNvDPFJcc/KI3qthR7MAGd4x1igSFM413ap6Cpnd2XWl0su9Z54GCePXWM7s5smVr/B PZmQ== MIME-Version: 1.0 Received: by 10.49.30.34 with SMTP id p2mr21483005qeh.15.1354085755559; Tue, 27 Nov 2012 22:55:55 -0800 (PST) Received: by 10.49.60.69 with HTTP; Tue, 27 Nov 2012 22:55:55 -0800 (PST) In-Reply-To: References: Date: Wed, 28 Nov 2012 12:25:55 +0530 Message-ID: Subject: Re: advice From: Mahesh Balija To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7bdc8d26749d9204cf88a826 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc8d26749d9204cf88a826 Content-Type: text/plain; charset=ISO-8859-1 Hi Jamal, Please follow the inline answers, On Wed, Nov 28, 2012 at 10:47 AM, jamal sasha wrote: > Hi, > Lately, I have been writing alot of algorithms in map reduce abstraction > in python (hadoop streaming). > I have got a hang of it (I think)... > I have couple of questions: > 1) By not using java libraries, what power of hadoop am I missing? > Though I am NOT very sure, -> I believe there is NO better control over the job while using streaming API. -> Using java, in reducer phase the values get automatically aggregated (Iterator) for a given key. But in Streaming jobs user has to take care of aggregating/processing the values based on key -> In normal case the framework will call map function once per each line, but in streaming you have the better control over processing multiple lines > 2) I know that this is just the tip of the iceberg, can someone point out > from practical usage, what are some of the concepts I should focus on next > ( like maybe practising combiners or hdfs??) which will improve on my > current practical knowledge and then offcourse the not so practical part as > well? > Sorry for being so vague. > -> Its better start learning basics of HDFS, MapReduce architectures, and then concepts like combiners, partitioner, recordreader, inputformats, outputformats etc Best, Mahesh Balija, Calsoft Labs. --047d7bdc8d26749d9204cf88a826 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Jamal,

=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Please follow the = inline answers,

On Wed, Nov 28, 2012 at 1= 0:47 AM, jamal sasha <jamalshasha@gmail.com> wrote:
Hi,
=A0 Lately, I have been writing alot= of algorithms in map reduce abstraction in python (hadoop streaming).
I have got a hang of it (I think)...
I have couple of questions:<= /div>
1) By not using java libraries, what power of hadoop am I missing= ?
Though I am NOT= very sure,
-> I believe there is NO better contr= ol over the job while using streaming API.
-> Using java, in reducer phas= e the values get automatically aggregated (Iterator) for a given key. But i= n Streaming jobs user has to take care of aggregating/processing the values= based on key
-> In normal case the framework will = call map function once per each line, but in streaming you have the better = control over processing multiple lines
2) I know that this is just the tip of the iceberg, can someone point = out from practical usage, what are some of the concepts I should focus on n= ext ( like maybe practising combiners or hdfs??) which will improve on my c= urrent practical knowledge and then offcourse the not so practical part as = well?
Sorry for being so vague.
-> Its better start learning basics of HDFS, MapReduce arc= hitectures, and then concepts like combiners, partitioner, recordreader, in= putformats, outputformats etc

Best,
Mahesh Balija,
Calsoft Labs.

--047d7bdc8d26749d9204cf88a826--