Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 5940 invoked from network); 12 Feb 2008 21:34:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Feb 2008 21:34:29 -0000 Received: (qmail 24795 invoked by uid 500); 12 Feb 2008 21:34:21 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 24504 invoked by uid 500); 12 Feb 2008 21:34:20 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 24492 invoked by uid 99); 12 Feb 2008 21:34:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Feb 2008 13:34:20 -0800 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of fern@alum.mit.edu does not designate 66.111.4.28 as permitted sender) Received: from [66.111.4.28] (HELO out4.smtp.messagingengine.com) (66.111.4.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Feb 2008 21:33:35 +0000 Received: from compute1.internal (compute1.internal [10.202.2.41]) by out1.messagingengine.com (Postfix) with ESMTP id 847738FD51; Tue, 12 Feb 2008 16:33:56 -0500 (EST) Received: from heartbeat1.messagingengine.com ([10.202.2.160]) by compute1.internal (MEProxy); Tue, 12 Feb 2008 16:33:56 -0500 X-Sasl-enc: VZExQxf6M38ixvm3aRcvetUfE0n5KyQGMX8kGtg4qWCl 1202852036 Received: from [10.0.7.180] (unknown [63.202.1.90]) by mail.messagingengine.com (Postfix) with ESMTP id EBF46D6A6; Tue, 12 Feb 2008 16:33:55 -0500 (EST) Message-ID: <47B210C4.6070407@alum.mit.edu> Date: Tue, 12 Feb 2008 13:33:56 -0800 From: Fernando Padilla User-Agent: Thunderbird 2.0.0.6 (X11/20071022) MIME-Version: 1.0 To: core-user@hadoop.apache.org CC: Miles Osborne Subject: Re: key/value after reduce References: <200802121221.26308@kmail.yap.isi.edu> <73e5a5310802121233r7382679ya174ed2c89d845ae@mail.gmail.com> <200802121319.25257@kmail.yap.isi.edu> In-Reply-To: <200802121319.25257@kmail.yap.isi.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Well.. I'm no hadoop expert, but let me brainstorm for a little bit: Aren't there Output classes that take a key-value pair as input, then they get to decide how/what to actually output. That's how you can direct the output directly to HBase, etc.. You could create (and hadoop should include by default), a ValueOutputEncoder, that all it does it output the values, ignoring the key part.. Thus you get what you want.. not necessarily requiring a key/value pair output. You could even have an outputter that took an InputStream as the Value part.. so that it could stream the output..?? possibly? How far off is this idea? There is also nothing holding you back from having your Reducer output directly to another data/store system. Then "output" of the reducer job would be empty, or for debug maybe the content-length of what it put in a different file.. :) But keep in mind, I think the BIG idea behind Hadoop is divide and conquer. That means arbitrarily cut up input, transform it once, sort, transform it once more, output. But the idea is that this should hopefully support N different output files. I am guessing the key/value pair arrangement gives those output files context and meaning, or you wouldn't be able to conceptually put them back together into a coherent collection of data. I just remembered, you can force it to only use 1 Reduce job, thus only one output file, but that won't scale perfectly.. :) But for your purposes, you could have M map jobs, 1 Reduce job, and use a ValueOutputEncoder that ignores the key part and only spits out a binary file.. :) Yuri Pradkin wrote: > But OTOH, if I wanted my reducer to write binary output, I'd be > screwed, especially so in the streaming world (where I'd like to stay > for the moment). > > Actually, I don't think I understand your point: if the reducer's > output is in a key/value format, you still can run another map over it > or another reduce, can't you? If the output isn't, you can't; it's up > to the user who coded up the Reducer. What am I missing? > > Thanks, > > -Yuri > > On Tue 12 2008, Miles Osborne wrote: >> You may well have another Map operation operate over the Reducer >> output, in which case you'd want key-value pairs. >> >> Miles >> >> On 12/02/2008, Yuri Pradkin wrote: >>> Hi, >>> >>> I'm relatively new to Hadoop and I have what I hope is a simple >>> question: >>> >>> I don't understand why the key/value assumption is preserved AFTER >>> the reduce operation, in other words why the output of a reducer >>> is expected as instead of arbitrary, possibly binary >>> bytes? Why can't OutputCollector just give those raw bytes to the >>> RecordWriter and have it make sense of them as it pleases, or just >>> dump them to a file? >>> >>> This seems like an unnecessary restriction to me, at least at the >>> first glance. >>> >>> Thanks, >>> >>> -Yuri > >