hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: Merging files
Date Sat, 22 Dec 2012 22:05:51 GMT
A pig script should work quite well.

I also note that the file paths have maprfs in them.  This implies that you
are using MapR and could simply use the normal linux command cat to
concatenate the files if you mount the files using NFS (depending on
volume, of course).  For small amounts of data, this would work very well.
 For large amounts of data, you would be better with some kind of
map-reduce program.  Your Pig script is just the sort of thing.

Keep in mind if you write a map-reduce program (or pig script) that you
will wind up with as many outputs as you have reducers.  If you have only a
single reducer, you will get one output file, but that will mean that only
a single process will do all the writing.  That would be no faster than
using the cat + NFS method above.  Having multiple reducers will allow you
to have write parallelism.

The error message that distcp is giving you is a little odd, however, since
it implies that some of your input files are repeated.  Is that possible?



On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:

> Tried distcp but it fails. Is there a way to merge them? Or else I could
> write a pig script to load from multiple paths
>
>
> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
> are duplicated files in the sources:
> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>
> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>
> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>
> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>
>
> On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <tdunning@maprtech.com>wrote:
>
>> The technical term for this is "copying".  You may have heard of it.
>>
>> It is a subject of such long technical standing that many do not consider
>> it worthy of detailed documentation.
>>
>> Distcp effects a similar process and can be modified to combine the input
>> files into a single file.
>>
>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>
>>
>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <barak.yaish@gmail.com>wrote:
>>
>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <harsh@cloudera.com> wrote:
>>>
>>>> Yes, via the simple act of opening a target stream and writing all
>>>> source streams into it. Or to save code time, an identity job with a
>>>> single reducer (you may not get control over ordering this way).
>>>>
>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mohitanchlia@gmail.com>
>>>> wrote:
>>>> > Is it possible to merge files from different locations from HDFS
>>>> location
>>>> > into one file into HDFS location?
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>

Mime
View raw message