Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 30230E587 for ; Sun, 23 Dec 2012 15:31:24 +0000 (UTC) Received: (qmail 21486 invoked by uid 500); 23 Dec 2012 15:31:19 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 21398 invoked by uid 500); 23 Dec 2012 15:31:19 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 21390 invoked by uid 99); 23 Dec 2012 15:31:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Dec 2012 15:31:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com designates 209.85.223.182 as permitted sender) Received: from [209.85.223.182] (HELO mail-ie0-f182.google.com) (209.85.223.182) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Dec 2012 15:31:13 +0000 Received: by mail-ie0-f182.google.com with SMTP id s9so8289402iec.27 for ; Sun, 23 Dec 2012 07:30:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Bz/eO0QXRFzgT1aY/Tn6r5gOkJxz5zmJ/14In2268Sc=; b=zYd/uAKgGZRhfbNgQ7xnkYlgXraY+rvtjx5LZRBRwfxZHLozX8lh47558XFwvcJNKb G+dCIIOyoSizoKpsGIqQ1oN3m5JhqC+LcYD+kQLdU86D7+71dDQta2QyWBkv2GSFQqE1 W+6X5awvgNyEmnxNSwkXsPRtiEapp1V1N6+WaV4Sl6PIezZcDHZgWDNv4Z6O5454b2DA 7l8QAn3l2+Xnz/XwMIGlsuY8nPcOJobI/gMT0zojzjQ/4+8YexKczksmAQmpQ93KxsBD 8cJG5P541PXTC1k9dLnSvb2jfgFOvh1l8BQXwB6wwF0t02iv497moEf8V8PJ/qzJeNsM jA7w== MIME-Version: 1.0 Received: by 10.50.169.106 with SMTP id ad10mr13251798igc.88.1356276652298; Sun, 23 Dec 2012 07:30:52 -0800 (PST) Received: by 10.64.97.106 with HTTP; Sun, 23 Dec 2012 07:30:52 -0800 (PST) In-Reply-To: References: Date: Sun, 23 Dec 2012 10:30:52 -0500 Message-ID: Subject: Re: Merging files From: Edward Capriolo To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=e89a8f2346bb13d0b804d186c43e X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f2346bb13d0b804d186c43e Content-Type: text/plain; charset=ISO-8859-1 https://github.com/edwardcapriolo/filecrush ^ Another option On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia wrote: > Thanks for the info. I was trying not to use nfs because my data size > might be 10-20GB in size for every merge I perform. I'll use pig instead. > > In dstcp I checked and none of the directories are duplicate. Looking at > the logs it looks like it's failing because all those directories have > sub-directories of the same name. > > On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning wrote: > >> A pig script should work quite well. >> >> I also note that the file paths have maprfs in them. This implies that >> you are using MapR and could simply use the normal linux command cat to >> concatenate the files if you mount the files using NFS (depending on >> volume, of course). For small amounts of data, this would work very well. >> For large amounts of data, you would be better with some kind of >> map-reduce program. Your Pig script is just the sort of thing. >> >> Keep in mind if you write a map-reduce program (or pig script) that you >> will wind up with as many outputs as you have reducers. If you have only a >> single reducer, you will get one output file, but that will mean that only >> a single process will do all the writing. That would be no faster than >> using the cat + NFS method above. Having multiple reducers will allow you >> to have write parallelism. >> >> The error message that distcp is giving you is a little odd, however, >> since it implies that some of your input files are repeated. Is that >> possible? >> >> >> >> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia wrote: >> >>> Tried distcp but it fails. Is there a way to merge them? Or else I could >>> write a pig script to load from multiple paths >>> >>> >>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, >>> there are duplicated files in the sources: >>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo, >>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo >>> >>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419) >>> >>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222) >>> >>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675) >>> >>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910) >>> >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>> >>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937) >>> >>> >>> On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning wrote: >>> >>>> The technical term for this is "copying". You may have heard of it. >>>> >>>> It is a subject of such long technical standing that many do not >>>> consider it worthy of detailed documentation. >>>> >>>> Distcp effects a similar process and can be modified to combine the >>>> input files into a single file. >>>> >>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html >>>> >>>> >>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish wrote: >>>> >>>>> Can you please attach HOW-TO links for the alternatives you mentioned? >>>>> >>>>> >>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J wrote: >>>>> >>>>>> Yes, via the simple act of opening a target stream and writing all >>>>>> source streams into it. Or to save code time, an identity job with a >>>>>> single reducer (you may not get control over ordering this way). >>>>>> >>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia < >>>>>> mohitanchlia@gmail.com> wrote: >>>>>> > Is it possible to merge files from different locations from HDFS >>>>>> location >>>>>> > into one file into HDFS location? >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Harsh J >>>>>> >>>>> >>>>> >>>> >>> >> > --e89a8f2346bb13d0b804d186c43e Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable https://github.com/= edwardcapriolo/filecrush

^ Another option

On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia <mohitanch= lia@gmail.com> wrote:
Thanks for the info.= I was trying not to use nfs because my data size might be 10-20GB in size = for every merge I perform. I'll use pig instead.
=A0
In dstcp I checked and none of the directories are duplicate. Looking = at the logs it looks like it's failing because all those directories ha= ve sub-directories of the same name.

On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <tdunning@maprtech.com> wrote:
A pig script should work = quite well.=20

I also note that the file paths have maprfs in them. =A0This implies t= hat you are using MapR and could simply use the normal linux command cat to= concatenate the files if you mount the files using NFS (depending on volum= e, of course). =A0For small amounts of data, this would work very well. =A0= For large amounts of data, you would be better with some kind of map-reduce= program. =A0Your Pig script is just the sort of thing.

Keep in mind if you write a map-reduce program (or pig script) that yo= u will wind up with as many outputs as you have reducers. =A0If you have on= ly a single reducer, you will get one output file, but that will mean that = only a single process will do all the writing. =A0That would be no faster t= han using the cat + NFS method above. =A0Having multiple reducers will allo= w you to have write parallelism.

The error message that distcp is giving you is a little odd, however, = since it implies that some of your input files are repeated. =A0Is that pos= sible?



On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia = <mohitanchlia@gmail.com> wrote:
Tried distcp but it fails. Is there a way to merge them? Or else I cou= ld write a pig script to load from multiple paths
=A0

org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, ther= e are duplicated files in the sources: maprfs:/user/apuser/web-analytics/fl= ume-output/2012/12/20/22/output/appinfo, maprfs:/user/apuser/web-analytics/= flume-output/2012/12/21/00/output/appinfo

at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)

at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)

at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)


On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <tdunning@maprtech.com> wrote:
The technical term for th= is is "copying". =A0You may have heard of it.=20

It is a subject of such long technical standing that many do not consi= der it worthy of detailed documentation.

Distcp effects a similar process and can be modified to combine the in= put files into a single file.



On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <barak.yaish@gmail.com> wrote:
Can you please attach HOW-TO links for the alternatives yo= u mentioned?=20


On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <har= sh@cloudera.com> wrote:
Yes, via the simple act o= f opening a target stream and writing all
source streams into it. Or to = save code time, an identity job with a
single reducer (you may not get control over ordering this way).

On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mohitanchlia@gmail.com> wrote:
&= gt; Is it possible to merge files from different locations from HDFS locati= on
> into one file into HDFS location?



= --
Harsh J






--e89a8f2346bb13d0b804d186c43e--