hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sasha Dolgy <sdo...@gmail.com>
Subject Re: large files vs many files
Date Wed, 06 May 2009 09:03:53 GMT
Hi Tom,
Thanks for this.  I'll follow that up and see how I get on.  At issue is the
frequency of the data I have streaming in.  Even if I create a new file with
a name based on milliseconds I'm still running into the same problems.  My
thought is that using append, although it's not production ready, it's also
not the root of my problems.

cheers
-sd

On Wed, May 6, 2009 at 9:40 AM, Tom White <tom@cloudera.com> wrote:

> Hi Sasha,
>
> As you say, HDFS appends are not yet working reliably enough to be
> suitable for production use. On the other hand, having lots of little
> files is bad for the namenode, and inefficient for MapReduce (see
> http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so
> it's best to avoid this too.
>
> I would recommend using SequenceFile as a storage container for lots
> of small pieces of data. Each key-value pair would represent one of
> your little files (you can have a null key, if you only need to store
> the contents of the file). You can also enable compression (use block
> compression), and SequenceFiles are designed to work well with
> MapReduce.
>
> Cheers,
>
> Tom
>
> On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy <sasha.dolgy@gmail.com>
> wrote:
> > hi there,
> > working through a concept at the moment and was attempting to write lots
> of
> > data to few files as opposed to writing lots of data to lots of little
> > files.  what are the thoughts on this?
> >
> > When I try and implement outputStream = hdfs.append(path); there doesn't
> > seem to be any locking mechanism in place ... or there is and it doesn't
> > work well enough for many writes per second?
> >
> > i have read and seen that the property "dfs.support.append" is not meant
> for
> > production use.  still, if millions of little files are as good or better
> > --- or no difference -- to a few massive files then i suppose append
> isn't
> > something i really need.
> >
> > I do see a lot of stack traces with messages like:
> >
> > org.apache.hadoop.ipc.RemoteException:
> > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
> > create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on
> client
> > 127.0.0.1 because current leaseholder is trying to recreate file.
> >
> > i hope this make sense.  still a little bit confused.
> >
> > thanks in advance
> > -sd
> >
> > --
> > Sasha Dolgy
> > sasha.dolgy@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message