hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sridhar basam <...@basam.org>
Subject Re: What is the most efficient way to copy a large number of .gz files into HDFS?
Date Fri, 12 Aug 2011 17:51:35 GMT
On Fri, Aug 12, 2011 at 1:29 PM, W.P. McNeill <billmcn@gmail.com> wrote:

> I have a large number of gzipped web server logs on NFS that I need to pull
> into HDFS for analysis by MapReduce.  What is the most efficient way to do
> this?
>
> It seems like what I should do is:
>
> hadoop fs -copyFromLocal *.gz /my/HDFS/directory
>
> A couple of questions:
>
>   1. Is this single process, or will the files be copied up in parallel?
>

It will use a single process to do the copy. You could just have multiple
-copyFromLocal or moveFromLocal to improve speed.


>   2. Gzip is not a desirable compression format because it's not
>   splittable. What's the best way to get these files into a better format?
>   Should I run zcat > bzip before calling copyFromLocal or write a Hadoop
> job?
>

If you have lzo working, i would recommend it. Running mapreduce jobs using
lzo was measurably quicker in my setup. While bzip2 provides better
compression ratios, it is far too cpu intensive compared to lzo/gzip.  If
you have multiple gzip files, you might still be able to increase
parallelizism by having multiple mapper run on the individual gzip files but
still be 1 per file. I don't specifically recall if gzip/bzip2 was better in
my case.

 Sridhar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message