Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 56743 invoked from network); 22 Jul 2010 20:40:16 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Jul 2010 20:40:16 -0000 Received: (qmail 63844 invoked by uid 500); 22 Jul 2010 20:40:13 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 63776 invoked by uid 500); 22 Jul 2010 20:40:12 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 63768 invoked by uid 99); 22 Jul 2010 20:40:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jul 2010 20:40:12 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dnquark@gmail.com designates 209.85.214.176 as permitted sender) Received: from [209.85.214.176] (HELO mail-iw0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jul 2010 20:40:04 +0000 Received: by iwn37 with SMTP id 37so11920260iwn.35 for ; Thu, 22 Jul 2010 13:39:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=/gYbZR1KGmPGOdm4bbLJ4pFYPvvyVEEvX+TvrVxVmLU=; b=IDa4IIaXf6jaae18VpsX3krfoq7yXDusBWhAoTgRiR4ykLLeTvgauuz4rYzGCiB/Zt ENXW9/YQaiZxbiX4SJtnozT3vz+k4FRRQ2EsRkjl15ckNHIIwAXJSy6mAt8u4x+vwaIR AB5dJIH7AYF54+EZuC4ieOeOtK+vij/jYHUWE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=pU1l8s3ezN9/Yc03ytDX4xH7VKz6btR2UdPWrFshMFmZTKt7tr9DsbHJq6aNVTncmu 2ShPQNPfb3shUrQZp7EEEst5c5nbcD1gCt23cuyXiKNciupn5UhYvhd2j4mb+BYKXGsM ZJ4uteftEsp888qPhJT5mighKg1Yop0oKXYKU= MIME-Version: 1.0 Received: by 10.231.152.143 with SMTP id g15mr2442764ibw.76.1279831183395; Thu, 22 Jul 2010 13:39:43 -0700 (PDT) Received: by 10.231.183.201 with HTTP; Thu, 22 Jul 2010 13:39:43 -0700 (PDT) In-Reply-To: References: Date: Thu, 22 Jul 2010 13:39:43 -0700 Message-ID: Subject: Re: Is it possible to use NullWritable in combiner? + general question about combining output from many small maps From: Leo Alekseyev To: common-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Thanks for everybody's responses. I think I've got things sorted out for the time being; some folks were asking me for clarification of my problem, so let me elaborate, for the list archives if nothing else. In brief, say I have 1000 mappers each outputting 20 MB chunks; my problem doesn't require a reduce step, but I'm not happy with the file being partitioned in many small chunks smaller than DFS block size. But when I tried running e.g. 4 reducers so that I get 4 5 GB files at the end, the reduce step took quite a bit longer than the map step, most of it being network traffic during the shuffle step. It seemed wasteful to me, in light of the fact that the reducers' only purpose was to "glue together" the small files. I am guessing that there's no way to get around this -- when using reducers you'll have to be sending chunks to machines that combine them. It is possible, however, to tweak the size of each map's output (and thus the number of mappers) by adjusting min split input size; for some of my jobs it's proving to be a good solution --Leo On Wed, Jul 21, 2010 at 2:57 AM, Himanshu Vashishtha wrote: > Please see my comments in-line, as per my understanding of Hadoop & your > problems. See if they are helpful. > > Cheers, > Himanshu > > On Wed, Jul 21, 2010 at 2:59 AM, Leo Alekseyev wrote: > >> Hi All, >> I have a job where all processing is done by the mappers, but each >> mapper produces a small file, which I want to combine into 3-4 large >> ones. =A0In addition, I only care about the values, not the keys, so >> NullWritable key is in order. =A0I tried using the default reducer >> (which according to the docs is identity) by setting >> job.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class) and >> using a NullWritable key on the mapper output. =A0However, this seems to >> concentrate the work on one reducer only. > > > NullWritable is a singleton class. So, the entire map output related to i= t > will go to a single reduce node. > > >> I then tried to output >> LongWritable as the mapper key, and write a combiner to output >> NullWritable (i.e. class GenerateLogLineProtoCombiner extends >> Reducer> ProtobufLineMsgWritable>); still using the default reducer. =A0This gave >> me the following error thrown by the combiner: >> >> 10/07/21 01:21:38 INFO mapred.JobClient: Task Id : >> attempt_201007122205_1058_m_000104_2, Status : FAILED >> java.io.IOException: wrong key class: class >> org.apache.hadoop.io.NullWritable is not class >> org.apache.hadoop.io.LongWritable >> =A0 =A0 =A0 =A0at org.apache.hadoop.mapred.IFile$Writer.append(IFile.jav= a:164) >> =A0 =A0 =A0 =A0 =A0......... >> >> A combiner goal is to lessen the =A0reducer's workload. Ideally, its out= put > key-value should be same as that of Mapper's output key-value. Therefore = the > error. > >> I was able to get things working by explicitly putting in an identity >> reducer that takes (LongWritable key, value) and outputs >> (NullWritable, value). =A0However, now most of my processing is in the >> reduce phase, which seems like a waste -- it's copying and sorting >> data, but all I really need is to "glue" together the small map >> outputs. >> >> Thus, my questions are: I don't really understand why the combiner is >> throwing an error here. =A0Does it simply not allow NullWritables on the >> output?... >> The second question is -- is there a standard strategy for quickly >> combining the many small map outputs? =A0Is it worth, perhaps, to look >> into adjusting the min split size for the mappers?.. (can this value >> be adjusted dynamically based on the input file size?..) >> >> I don't know of any such strategy. How about defining a smaller number o= f > reducers. I am also not able to understand teh problem. It will be great = if > you are a bit more specific (in terms of map file input and output size, = and > reduce output size). > > >> Thanks to anyone who can give me some pointers :) >> --Leo >> >