Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3EC6CE6AB for ; Tue, 4 Dec 2012 08:37:27 +0000 (UTC) Received: (qmail 41387 invoked by uid 500); 4 Dec 2012 08:37:22 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 41309 invoked by uid 500); 4 Dec 2012 08:37:22 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 41260 invoked by uid 99); 4 Dec 2012 08:37:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Dec 2012 08:37:20 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hemanty@thoughtworks.com designates 64.18.0.188 as permitted sender) Received: from [64.18.0.188] (HELO exprod5og109.obsmtp.com) (64.18.0.188) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Dec 2012 08:37:11 +0000 Received: from mail-qa0-f70.google.com ([209.85.216.70]) (using TLSv1) by exprod5ob109.postini.com ([64.18.4.12]) with SMTP ID DSNKUL22Iri53tWqTRVMs2Lx8wyoIosvl9Hu@postini.com; Tue, 04 Dec 2012 00:36:51 PST Received: by mail-qa0-f70.google.com with SMTP id hg5so1791671qab.5 for ; Tue, 04 Dec 2012 00:36:49 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=XHDXYjfN1nMUIDEQxVN13WuLk5BfTBFTzqZf+V6SFEE=; b=lOBrsYo7QPGo6D/zrleuY3uB8FgD46hcOQtCU9r12KumD8g/vc/0jHrI+oqBNOU0dA oYeS9UdY0EJ3il7syqcdWu/79JFTKni57caYBq7hPP3lsFv1cuSBuk7MENhkLlVu+x6k tRRHfrcug2rS7PEp69BsaLkueXzc8hl1Q1GrJlTpr7Z/wCBxXjfQTUOEvuiZXeSPp7nM amB4g629JU7M4FNNuRdCqrB6wy8/iBDv5oZIqPgfVpJDooV0Dbl6WI7bVwfVIGNEnt06 t8ZsOo/McEOJNUktbl4oUQLYWRPhBr57TquZsrWm2o2+sel9ok1yrZAIYmqFIVTpBGU9 SWSQ== Received: by 10.220.153.212 with SMTP id l20mr10917679vcw.1.1354610209700; Tue, 04 Dec 2012 00:36:49 -0800 (PST) MIME-Version: 1.0 Received: by 10.220.153.212 with SMTP id l20mr10917672vcw.1.1354610209570; Tue, 04 Dec 2012 00:36:49 -0800 (PST) Received: by 10.58.134.13 with HTTP; Tue, 4 Dec 2012 00:36:49 -0800 (PST) In-Reply-To: <527D1AEF4F04D445A1055B58C609A48523FF573E@FMSMSX101.amr.corp.intel.com> References: <527D1AEF4F04D445A1055B58C609A48523FF573E@FMSMSX101.amr.corp.intel.com> Date: Tue, 4 Dec 2012 14:06:49 +0530 Message-ID: Subject: Re: Using Hadoop infrastructure with input streams instead of key/value input From: Hemanth Yamijala To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=f46d04339cde59bdd704d002c4b9 X-Gm-Message-State: ALoCoQlJgGC1L7a+4lIFaXcJxZ3hDP8N2/mrvnXxQ23CrUmvU+TyNzrguNcAMtEgEEvhG05uE9VhuMo0ZQ/bLOr1XUVsC1gYz4pmx+q4ysyo5tO4vrbgah333B2mrNtBVMlek4BpGgvtI/cB4P6hBS9B+G8wQdjc2w== X-Virus-Checked: Checked by ClamAV on apache.org --f46d04339cde59bdd704d002c4b9 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi, I have not tried this myself before, but would libhdfs help ? http://hadoop.apache.org/docs/stable/libhdfs.html Thanks Hemanth On Mon, Dec 3, 2012 at 9:52 PM, Wheeler, Bill NPO < bill.npo.wheeler@intel.com> wrote: > I am trying to use Hadoop=92s partitioning/scheduling/storage > infrastructure to process many HDFS files of data in parallel (1 HDFS fil= e > per map task), but in a way that does not naturally fit into the key/valu= e > pair input framework. Specifically my application=92s =93map=94 function > equivalent does not want to receive formatted data as key/value > pairs=97instead, I=92d like to receive a Hadoop input stream object for m= y map > processing so that I can read bytes out in many different ways with much > greater flexibility and efficiency than what I=92d get with the key/value > pair input constraint. The input stream would handle the complexity of > fetching local and remote HDFS data blocks as needed on my behalf. The > result of the map processing would then conform to key/value pair map > outputs and be subsequently processed by traditional reduce code.**** > > ** ** > > I=92m guessing that I am not the only person who would like to read HDFS > file input directly as this capability could open up a new type of Hadoop > use models. Is there any support for acquiring input streams directly in= to > java map code? And is there any support for doing the same into C++ map > code ala Pipes?**** > > ** ** > > For added context, my application is in the video analytic space, > requiring me to read video files . I have implemented a solution, but it > is a hack with less than ideal characteristics: I have RecordReader code > which simply passes the HDFS filename thru in the key field of my key/val= ue > input. I=92m using Pipes to implement the map function in C++ code. The= C++ > map code then performs a system call, =93hadoop fs =96copyToLocal hdfs_fi= lename > local_filename=94 to put the entire HDFS file on the datanode=92s local f= ile > system where it is readable by C++ IO calls. I then simply open up this > file and process it. It would be much better to avoid having to do all t= he > extra IO associated with =93copyToLocal=94 and instead somehow receive an= input > stream object from which to directly read from HDFS.**** > > ** ** > > Any way of doing this in a more elegant fashion?**** > > ** ** > > Thanks,**** > > Bill**** > --f46d04339cde59bdd704d002c4b9 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi,

I have not tried this myself before, but would libhd= fs help ?=A0




On Mon, Dec 3, 2012 at 9:52 PM, = Wheeler, Bill NPO <bill.npo.wheeler@intel.com> wrot= e:

I am trying to use Hadoop=92s partitioning/schedulin= g/storage infrastructure to process many HDFS files of data in parallel (1 = HDFS file per map task), but in a way that does not naturally fit into the = key/value pair input framework.=A0 Specifically my application=92s =93map=94 function equivalent does not want to receive = formatted data as key/value pairs=97instead, I=92d like to receive a Hadoop= input stream object for my map processing so that I can read bytes out in = many different ways with much greater flexibility and efficiency than what I=92d get with the key/value pair input constrain= t.=A0 The input stream would handle the complexity of fetching local and re= mote HDFS data blocks as needed on my behalf.=A0 The result of the map proc= essing would then conform to key/value pair map outputs and be subsequently processed by traditional reduce code.=

=A0

I=92m guessing that I am not the only person who wou= ld like to read HDFS file input directly as this capability could open up a= new type of Hadoop use models.=A0 Is there any support for acquiring input= streams directly into java map code?=A0 And is there any support for doing the same into C++ map code ala Pipes?

=A0

For added context, my application is in the video an= alytic space, requiring me to read video files .=A0 I have implemented a so= lution, but it is a hack with less than ideal characteristics:=A0 I have Re= cordReader code which simply passes the HDFS filename thru in the key field of my key/value input.=A0 I=92m using = Pipes to implement the map function in C++ code.=A0 The C++ map code then p= erforms a system call, =93hadoop fs =96copyToLocal hdfs_filename local_file= name=94 to put the entire HDFS file on the datanode=92s local file system where it is readable by C++ IO calls.=A0 I then simply o= pen up this file and process it.=A0 It would be much better to avoid having= to do all the extra IO associated with =93copyToLocal=94 and instead someh= ow receive an input stream object from which to directly read from HDFS.

=A0

Any way of doing this in a more elegant fashion?<= /u>

=A0

Thanks,

Bill


--f46d04339cde59bdd704d002c4b9--