Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 89263 invoked from network); 20 Aug 2010 21:03:03 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 20 Aug 2010 21:03:03 -0000 Received: (qmail 78886 invoked by uid 500); 20 Aug 2010 21:03:00 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 78704 invoked by uid 500); 20 Aug 2010 21:02:59 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 78696 invoked by uid 99); 20 Aug 2010 21:02:59 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Aug 2010 21:02:59 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of qwertymaniac@gmail.com designates 74.125.82.176 as permitted sender) Received: from [74.125.82.176] (HELO mail-wy0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Aug 2010 21:02:54 +0000 Received: by wyb35 with SMTP id 35so5049777wyb.35 for ; Fri, 20 Aug 2010 14:02:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=Sp0APVl+LTSVAEUK91m+CWtvYS+ukI5yGK3yQGzQpC8=; b=UYdCI5MI8ztWTTb3potIdsW3GuH93SWq4Cdj8+jQFDREkm0eTaoXx+W71UECeVyxfb astJvMp211aO84vX39aAlZzCEPavaag6/Y24ViDyWcCOaQ4kVLLgOyuc4z2i4ILAZff0 VaDH3ICypCuT97Gg//rHQFE9IhpvB0KHjAn6U= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=LXPPtit4Sxdrc1P9dFwF/EnIxT7qpRyOq/BgIJ6YNE1AqLrrf5zqBAMnM3ZHf2nr1h KMNVfffKH1ddoXJqoP+/MaM2+VJtPyF4hH+43Gv3/0iIMH3UDAXrQXFVSsFF7i9hWSxZ Qmrbjxcn5R8/OagZaj31uc6z+z9C+sfSIDq+0= Received: by 10.216.1.6 with SMTP id 6mr1791505wec.24.1282338154156; Fri, 20 Aug 2010 14:02:34 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.236.197 with HTTP; Fri, 20 Aug 2010 14:02:14 -0700 (PDT) In-Reply-To: <4C6EE96E.70105@epfl.ch> References: <4C6D0E4A.8010105@epfl.ch> <4C6D65D9.4050305@epfl.ch> <4C6E5A39.4020705@epfl.ch> <4C6EE96E.70105@epfl.ch> From: Harsh J Date: Sat, 21 Aug 2010 02:32:14 +0530 Message-ID: Subject: Re: HDFS efficiently concat&split files To: common-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Appending one file to the other on HDFS itself would be the optimal way to do this, but you can't avail the append support in apache-hadoop-hdfs 0.20.x and previous releases. Its only available in trunk / 0.21. A similar question asked previously got some replies as well, you might want to check it out: http://www.mentby.com/Group/hadoop-core-user/concatenating-files-on-hdfs.ht= ml 2010/8/21 Teodor Macicas : > Hi again, > > Please, can anyone suggest me how to stick (concatenate) 2 HDFS files in > one bigger file with less I/O time ? > > Thank you. > Regards, > Teodor > > On 08/20/2010 12:34 PM, Teodor Macicas wrote: >> Hi, >> >> Basically, you are right. But in my case the second input is a >> combination of previous outputs. >> As I already told, I want only a certain amount of bytes to be the >> second input. Hence, I need to split some files. >> >> Also, I need to concatenate the final reducers' outputs. Does anyone >> know how to concatenate 2 files faster in HDFS ? >> >> Thank you. >> Best, >> Teodor >> >> On 08/20/2010 10:44 AM, xiujin yang wrote: >>> >>> Hi >>> >>> For mapred it is easy to realize the first job's output to be second jo= b's input. >>> =A0You just need to point out the path will be ok. >>> >>> Xiujinyang >>> >>> >>> >>> >>>> Date: Thu, 19 Aug 2010 19:11:53 +0200 >>>> From: teodor.macicas@epfl.ch >>>> To: common-user@hadoop.apache.org >>>> Subject: Re: HDFS efficiently concat&split files >>>> >>>> Hello, >>>> >>>> I was expecting this question. >>>> The reason is that I want to run 2 MR jobs on the same data in the >>>> following manner: output of 1st job is collected and then I want to >>>> create bins of a certain amount of bytes which will be the input for t= he >>>> next jobs. At the end I want to isolate each processed bin results [2n= d >>>> reducenrs outputs]. >>>> >>>> Anyway, I do have a reason for wanting to do this. >>>> Any ideas ? >>>> >>>> Thank you. >>>> -Tedy >>>> >>>> On 08/19/2010 06:57 PM, Harsh J wrote: >>>>> Hello, >>>>> >>>>> Why are you looking to concatenate or split files on the HDFS? Am jus= t >>>>> curious cause using directories as inputs and outputs works fine with >>>>> Hadoop MR and HDFS, as the latter uses a block storage concept at its >>>>> core. >>>>> >>>>> On Thu, Aug 19, 2010 at 4:28 PM, Teodor Macicas =A0wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> Does anyone know how to efficiently concatenate 2 different files in= HDFS, >>>>>> as well as splitting a file into 2 different ones ? >>>>>> I did this by read from a file, write to another one. Of course, thi= s is >>>>>> very slow, a lot of I/O time was spent. Being only a splitting or a = putting >>>>>> togheter job I am wondering if I can do this faster. >>>>>> >>>>>> Also, what can I do in oder to control a reducer output file size ? = This >>>>>> could be a solution of the previous question. If I would be able to = do this, >>>>>> further concats&splits are not neccessary. >>>>>> >>>>>> Thank you for your help. >>>>>> Best, >>>>>> Teodor >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>> >> > > --=20 Harsh J www.harshj.com