Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com
 designates 209.85.223.182 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOT3TWogBsHekN3W7LJhKUnAf82qkWTL_NWPAam4qFndgNn8xw@mail.gmail.com>
References: 
 <CAOT3TWrbMequCAiwgVJ+UYm81--8MDru6WAtq1XnnbEBNX2UYw@mail.gmail.com>
	<CAOcnVr1cekn8P7KN8_XSgxw6__VkjdpqsDMWC4HbrqbniUAbNQ@mail.gmail.com>
	<CAFWufDOZecmYZVMSr8eVHD4xMotXQEadgQJJtn76cFMnsTr7UA@mail.gmail.com>
	<CAND0qzvaD3aFKrBzy8-V4Zzca=H-HeTR75FUUNvy2NekjRwwPw@mail.gmail.com>
	<CAOT3TWrGbjtQv+s=CCeNfpnbZ5iWN9gUZ4jD-e36zfD6NU6ZYQ@mail.gmail.com>
	<CAND0qzv+bjQjXQ13UdU1s00wxxzX4nkZ+PHWMzx+aMc1py086Q@mail.gmail.com>
	<CAOT3TWogBsHekN3W7LJhKUnAf82qkWTL_NWPAam4qFndgNn8xw@mail.gmail.com>
Date: Sun, 23 Dec 2012 10:30:52 -0500
Message-ID: 
 <CAENxBwwZoyg3FHwN4L1+3jmsMGTuh+0BXxTi_ehQNC7LK9r7VQ@mail.gmail.com>
Subject: Re: Merging files
From: Edward Capriolo <edlinuxguru@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=e89a8f2346bb13d0b804d186c43e

--e89a8f2346bb13d0b804d186c43e
Content-Type: text/plain; charset=ISO-8859-1

https://github.com/edwardcapriolo/filecrush

^ Another option

On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:

> Thanks for the info. I was trying not to use nfs because my data size
> might be 10-20GB in size for every merge I perform. I'll use pig instead.
>
> In dstcp I checked and none of the directories are duplicate. Looking at
> the logs it looks like it's failing because all those directories have
> sub-directories of the same name.
>
> On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <tdunning@maprtech.com>wrote:
>
>> A pig script should work quite well.
>>
>> I also note that the file paths have maprfs in them.  This implies that
>> you are using MapR and could simply use the normal linux command cat to
>> concatenate the files if you mount the files using NFS (depending on
>> volume, of course).  For small amounts of data, this would work very well.
>>  For large amounts of data, you would be better with some kind of
>> map-reduce program.  Your Pig script is just the sort of thing.
>>
>> Keep in mind if you write a map-reduce program (or pig script) that you
>> will wind up with as many outputs as you have reducers.  If you have only a
>> single reducer, you will get one output file, but that will mean that only
>> a single process will do all the writing.  That would be no faster than
>> using the cat + NFS method above.  Having multiple reducers will allow you
>> to have write parallelism.
>>
>> The error message that distcp is giving you is a little odd, however,
>> since it implies that some of your input files are repeated.  Is that
>> possible?
>>
>>
>>
>> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:
>>
>>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>>> write a pig script to load from multiple paths
>>>
>>>
>>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input,
>>> there are duplicated files in the sources:
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>>
>>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>>
>>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>>
>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>>
>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>
>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>>
>>>
>>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <tdunning@maprtech.com>wrote:
>>>
>>>> The technical term for this is "copying".  You may have heard of it.
>>>>
>>>> It is a subject of such long technical standing that many do not
>>>> consider it worthy of detailed documentation.
>>>>
>>>> Distcp effects a similar process and can be modified to combine the
>>>> input files into a single file.
>>>>
>>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <barak.yaish@gmail.com>wrote:
>>>>
>>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>>
>>>>>
>>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <harsh@cloudera.com> wrote:
>>>>>
>>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>>> source streams into it. Or to save code time, an identity job with a
>>>>>> single reducer (you may not get control over ordering this way).
>>>>>>
>>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>>> mohitanchlia@gmail.com> wrote:
>>>>>> > Is it possible to merge files from different locations from HDFS
>>>>>> location
>>>>>> > into one file into HDFS location?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harsh J
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

--e89a8f2346bb13d0b804d186c43e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<a href=3D"https://github.com/edwardcapriolo/filecrush">https://github.com/=
edwardcapriolo/filecrush</a><br><br>^ Another option<br><br><div class=3D"g=
mail_quote">On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia <span dir=3D"ltr=
">&lt;<a href=3D"mailto:mohitanchlia@gmail.com" target=3D"_blank">mohitanch=
lia@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div>Thanks for the info.=
 I was trying not to use nfs because my data size might be 10-20GB in size =
for every merge I perform. I&#39;ll use pig instead.</div>

<div>=A0</div>
<div>In dstcp I checked and none of the directories are duplicate. Looking =
at the logs it looks like it&#39;s failing because all those directories ha=
ve sub-directories of the same name.<br><br></div>
<div class=3D"gmail_quote">On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:tdunning@maprtech.com" target=3D"_blan=
k">tdunning@maprtech.com</a>&gt;</span> wrote:<br>
<blockquote style=3D"border-left:1px solid rgb(204,204,204);margin:0px 0px =
0px 0.8ex;padding-left:1ex" class=3D"gmail_quote">A pig script should work =
quite well.=20
<div><br></div>
<div>I also note that the file paths have maprfs in them. =A0This implies t=
hat you are using MapR and could simply use the normal linux command cat to=
 concatenate the files if you mount the files using NFS (depending on volum=
e, of course). =A0For small amounts of data, this would work very well. =A0=
For large amounts of data, you would be better with some kind of map-reduce=
 program. =A0Your Pig script is just the sort of thing.</div>


<div><br></div>
<div>Keep in mind if you write a map-reduce program (or pig script) that yo=
u will wind up with as many outputs as you have reducers. =A0If you have on=
ly a single reducer, you will get one output file, but that will mean that =
only a single process will do all the writing. =A0That would be no faster t=
han using the cat + NFS method above. =A0Having multiple reducers will allo=
w you to have write parallelism.</div>


<div><br></div>
<div>The error message that distcp is giving you is a little odd, however, =
since it implies that some of your input files are repeated. =A0Is that pos=
sible?</div>
<div>
<div>
<div><br></div>
<div class=3D"gmail_extra"><br><br>
<div class=3D"gmail_quote">On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia =
<span dir=3D"ltr">&lt;<a href=3D"mailto:mohitanchlia@gmail.com" target=3D"_=
blank">mohitanchlia@gmail.com</a>&gt;</span> wrote:<br>
<blockquote style=3D"border-left:1px solid rgb(204,204,204);margin:0px 0px =
0px 0.8ex;padding-left:1ex" class=3D"gmail_quote">
<div>Tried distcp but it fails. Is there a way to merge them? Or else I cou=
ld write a pig script to load from multiple paths</div>
<div>=A0</div>
<div><font size=3D"1">
<p>org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, ther=
e are duplicated files in the sources: maprfs:/user/apuser/web-analytics/fl=
ume-output/2012/12/20/22/output/appinfo, maprfs:/user/apuser/web-analytics/=
flume-output/2012/12/21/00/output/appinfo</p>


<p>at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)</p>
<p>at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)</p>
<p>at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)</p>
<p>at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)</p>
<p>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)</p>
<p>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)</p>
<p>at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)</p></font><br><b=
r></div>
<div>
<div>
<div class=3D"gmail_quote">On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:tdunning@maprtech.com" target=3D"_bla=
nk">tdunning@maprtech.com</a>&gt;</span> wrote:<br>
<blockquote style=3D"border-left:1px solid rgb(204,204,204);margin:0px 0px =
0px 0.8ex;padding-left:1ex" class=3D"gmail_quote">The technical term for th=
is is &quot;copying&quot;. =A0You may have heard of it.=20
<div><br></div>
<div>It is a subject of such long technical standing that many do not consi=
der it worthy of detailed documentation.</div>
<div><br></div>
<div>Distcp effects a similar process and can be modified to combine the in=
put files into a single file.</div>
<div><br></div>
<div><a href=3D"http://hadoop.apache.org/docs/r1.0.4/distcp.html" target=3D=
"_blank">http://hadoop.apache.org/docs/r1.0.4/distcp.html</a><br></div>
<div>
<div>
<div class=3D"gmail_extra"><br><br>
<div class=3D"gmail_quote">On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:barak.yaish@gmail.com" target=3D"_bla=
nk">barak.yaish@gmail.com</a>&gt;</span> wrote:<br>
<blockquote style=3D"border-left:1px solid rgb(204,204,204);margin:0px 0px =
0px 0.8ex;padding-left:1ex" class=3D"gmail_quote">
<div dir=3D"ltr">Can you please attach HOW-TO links for the alternatives yo=
u mentioned?=20
<div>
<div><br><br>
<div class=3D"gmail_quote">On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <span =
dir=3D"ltr">&lt;<a href=3D"mailto:harsh@cloudera.com" target=3D"_blank">har=
sh@cloudera.com</a>&gt;</span> wrote:<br>
<blockquote style=3D"border-left:1px solid rgb(204,204,204);margin:0px 0px =
0px 0.8ex;padding-left:1ex" class=3D"gmail_quote">Yes, via the simple act o=
f opening a target stream and writing all<br>source streams into it. Or to =
save code time, an identity job with a<br>

single reducer (you may not get control over ordering this way).<br>
<div>
<div><br>On Sat, Dec 22, <a href=3D"tel:2012" value=3D"+9722012" target=3D"=
_blank">2012</a> at 12:10 PM, Mohit Anchlia &lt;<a href=3D"mailto:mohitanch=
lia@gmail.com" target=3D"_blank">mohitanchlia@gmail.com</a>&gt; wrote:<br>&=
gt; Is it possible to merge files from different locations from HDFS locati=
on<br>

&gt; into one file into HDFS location?<br><br><br><span class=3D"HOEnZb"><f=
ont color=3D"#888888"><br></font></span></div></div><span class=3D"HOEnZb">=
<font color=3D"#888888"><span><font color=3D"#888888">--<br>Harsh J<br></fo=
nt></span></font></span></blockquote>
</div><span class=3D"HOEnZb"><font color=3D"#888888"><br></font></span></di=
v></div></div></blockquote></div><span class=3D"HOEnZb"><font color=3D"#888=
888"><br></font></span></div></div></div></blockquote></div><span class=3D"=
HOEnZb"><font color=3D"#888888">
<br></font></span></div></div></blockquote></div><span class=3D"HOEnZb"><fo=
nt color=3D"#888888"><br></font></span></div></div></div></blockquote></div=
><br>
</blockquote></div><br>

--e89a8f2346bb13d0b804d186c43e--