hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Madhav Sharan <msha...@usc.edu>
Subject Re: merging small files in HDFS
Date Thu, 03 Nov 2016 19:14:34 GMT
Will key value based sequence file format work for you? You can keep KEY as
name of your small file and VALUE as content. Sequence files can be passed
as input to other jobs too.

[0] can be a code reference which converts many small files into a big
sequence file in mapreduce fashion. [1] is a good blogpost about it.

getmerge will work too just that it will merge it on local fs and you will
have to copy it back to hdfs. It's best though if it's a one time activity,
file count isn't huge you want to merge file content not knowing where one
file ends and other start.

[0] - Code snippet -
https://github.com/USCDataScience/hadoop-pot/blob/master/hadoop-pot-core/src/main/java/org/pooledtimeseries/seqfile/TextVectorsToSequenceFile.java

[1] - Blog for handling small files -
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

Cheers!

--
Madhav Sharan


On Thu, Nov 3, 2016 at 6:58 AM, kumar, Senthil(AWF) <senthikumar@ebay.com>
wrote:

> Can't we use getmerge here ?  If you requirement is to merge some files in
> a particular directory to single file ..
>
>
>
> hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
>
>
>
> --Senthil
>
> -----Original Message-----
>
> From: Giovanni Mascari [mailto:giovanni.mascari@polito.it]
>
> Sent: Thursday, November 03, 2016 7:24 PM
>
> To: Piyush Mukati <piyush.mukati@gmail.com>; user@hadoop.apache.org
>
> Subject: Re: merging small files in HDFS
>
>
>
> Hi,
>
> if I correctly understand your request you need only to merge some data
> resulting from an hdfs write operation.
>
> In this case, I suppose that your best option is to use hadoop-stream with
> 'cat' command.
>
>
>
> take a look here:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.
> apache.org_docs_r1.2.1_streaming.html&d=DgIGaQ&c=
> clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=
> V5uaM4YBn9uAuMzligWHKmh7D528KYMeolDp8EPrzSw&s=KTXzdXM2hkUMAShawrin_
> ngnnnFq3SOsG7OH7ECSIrc&e=
>
>
>
> Regards
>
>
>
> Il 03/11/2016 13:53, Piyush Mukati ha scritto:
>
> > Hi,
>
> > I want to merge multiple files in one HDFS dir to one file. I am
>
> > planning to write a map only job using input format which will create
>
> > only one inputSplit per dir.
>
> > this way my job don't need to do any shuffle/sort.(only read and write
>
> > back to disk) Is there any such file format already implemented ?
>
> > Or any there better solution for the problem.
>
> >
>
> > thanks.
>
> >
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>
> For additional commands, e-mail: user-help@hadoop.apache.org
>
>
>
>

Mime
View raw message