hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Kerzner <markkerz...@gmail.com>
Subject Re: best way to copy all files from a file system to hdfs
Date Mon, 02 Feb 2009 15:46:21 GMT
No, no reason for a single file - just a little simpler to think about. By
the way, can multiple MapReduce workers read the same SequenceFile
simultaneously?

On Mon, Feb 2, 2009 at 9:42 AM, Tom White <tom@cloudera.com> wrote:

> Is there any reason why it has to be a single SequenceFile? You could
> write a local program to write several block compressed SequenceFiles
> in parallel (to HDFS), each containing a portion of the files on your
> PC.
>
> Tom
>
> On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner <markkerzner@gmail.com>
> wrote:
> > Truly, I do not see any advantage to doing this, as opposed to writing
> > (Java) code which will copy files to HDFS, because then tarring becomes
> my
> > bottleneck. Unless I write code measure the file sizes and prepare
> pointers
> > for multiple tarring tasks. It becomes pretty complex though, and I
> thought
> > of something simple. I might as well accept that copying one hard drive
> to
> > HDFS is not going to be parallelized.
> > Mark
> >
> > On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
> > <flip@infochimps.org>wrote:
> >
> >> Could you tar.bz2 them up (setting up the tar so that it made a few
> dozen
> >> files), toss them onto the HDFS, and use
> >> http://stuartsierra.com/2008/04/24/a-million-little-files
> >> to go into SequenceFile?
> >>
> >> This lets you preserve the originals and do the sequence file conversion
> >> across the cluster. It's only really helpful, of course, if you also
> want
> >> to
> >> prepare a .tar.bz2 so you can clear out the sprawl
> >>
> >> flip
> >>
> >> On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <markkerzner@gmail.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I am writing an application to copy all files from a regular PC to a
> >> > SequenceFile. I can surely do this by simply recursing all directories
> on
> >> > my
> >> > PC, but I wonder if there is any way to parallelize this, a MapReduce
> >> task
> >> > even. Tom White's books seems to imply that it will have to be a
> custom
> >> > application.
> >> >
> >> > Thank you,
> >> > Mark
> >> >
> >>
> >>
> >>
> >> --
> >> http://www.infochimps.org
> >> Connected Open Free Data
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message