Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 67717107B1 for ; Tue, 20 Aug 2013 05:39:57 +0000 (UTC) Received: (qmail 73593 invoked by uid 500); 20 Aug 2013 05:39:48 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 73484 invoked by uid 500); 20 Aug 2013 05:39:45 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 73460 invoked by uid 99); 20 Aug 2013 05:39:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Aug 2013 05:39:44 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates 209.85.223.181 as permitted sender) Received: from [209.85.223.181] (HELO mail-ie0-f181.google.com) (209.85.223.181) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Aug 2013 05:39:38 +0000 Received: by mail-ie0-f181.google.com with SMTP id tp5so1058762ieb.26 for ; Mon, 19 Aug 2013 22:39:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type:content-transfer-encoding; bh=AnZA88BUnQ2Obaf6DgX4VpvHE1XQP1N/ZnCuhER4qtY=; b=jLlgddIvESBtXiXScKUnF498UscmWKK9vfDFIzR8Flx1pq+F1st9mAhd53xSEJm+N1 M3ynTFpGnQ7/lfv6I6DGuj+Na8j28yKcDXOze2D6d4j172i2GsHnEq8zXk9ZaOrtvt/7 5cR/EcRuFGAp/phpV1+D96nI8VTPukWCJffLRl5i3dYL4WJiHxNGYHUMVpwvft5Jmpio A0z5cMeiWXQYhh3QPhj73UCAHLrOlgLbDJ5/tmZR4G82hugVLA0XG0Cfn6MF+eb5AgyC cHoWg6Bezw4mkCS6nhA5hKZeQXjPibf8KpPFEkm7+CDfIgCi7tnqynrquRANLeREE2G7 wXZA== X-Gm-Message-State: ALoCoQmUhwIBbibfWU9zuYVDBfim8CxXuiDnY9ig8UdQ7iYA2xt6GIspX6SGkTjITIohU/gJUIIG X-Received: by 10.50.72.33 with SMTP id a1mr5173681igv.58.1376977157767; Mon, 19 Aug 2013 22:39:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.101.202 with HTTP; Mon, 19 Aug 2013 22:38:57 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Tue, 20 Aug 2013 11:08:57 +0530 Message-ID: Subject: Re: produce a large sequencefile (1TB) To: "" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Unfortunately given the way Reducers work today you wouldn't be able to do this. They are designed to fetch all data before the merge, sort and process it through the reducer implementation. For that to work, as you've yourself deduced, you will need as much space locally available. What you could do however, is perhaps just run a Map-only job, let it produce smaller files, then run a non-MR java app that reads them all one by one, and appends to a single HDFS SequenceFile. This is like a reducer, but minus a local sort phase. If the sort is important to you as well, then your tweaking will have to go further into using multiple reducers with Total Order Partitioning, and then running this external java app. On Tue, Aug 20, 2013 at 8:25 AM, Bing Jiang wrot= e: > Hi Jerry, > > I think whether it is acceptable to set multiple reducers to generate mor= e > MapFile(IndexFile, DataFile)s. > > I want to know the real difficulties of multiply reducer to post-processi= ng. > Maybe there are some questions about app? > > > > 2013/8/20 Jerry Lam >> >> Hi Bing, >> >> you are correct. The local storage does not have enough capacity to hold >> the temporary files generated by the mappers. Since we want a single >> sequence file at the end, we are forced to use 1 reducer. >> >> The use case is that we want to generate an index for the 1TB sequence >> file that we can randomly access each row in the sequence file. In pract= ice, >> this is simply a MapFile. >> >> Any idea how to resolve this dilemma is greatly appreciated. >> >> Jerry >> >> >> >> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang >> wrote: >>> >>> hi,Jerry. >>> I think you are worrying about the volumn of mapreduce local file, but >>> would you give us more details about your apps. >>> >>> On Aug 20, 2013 6:09 AM, "Jerry Lam" wrote: >>>> >>>> Hi Hadoop users and developers, >>>> >>>> I have a use case that I need produce a large sequence file of 1 TB in >>>> size when each datanode has 200GB of storage but I have 30 datanodes. >>>> >>>> The problem is that no single reducer can hold 1TB of data during the >>>> reduce phase to generate a single sequence file even I use aggressive >>>> compression. Any datanode will run out of space since this is a single >>>> reducer job. >>>> >>>> Any comment and help is appreciated. >>>> >>>> Jerry >> >> > > > > -- > Bing Jiang > Tel=EF=BC=9A(86)134-2619-1361 > weibo: http://weibo.com/jiangbinglover > BLOG: www.binospace.com > BLOG: http://blog.sina.com.cn/jiangbinglover > Focus on distributed computing, HDFS/HBase --=20 Harsh J