hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason hadoop <jason.had...@gmail.com>
Subject Re: large files vs many files
Date Sat, 09 May 2009 07:37:51 GMT
You must create unique file names, I don't believe (but I do not know) that
the append could will allow multiple writers.

Are you writing from within a task, or as an external application writing
into hadoop.

You may try using UUID,
http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of your
filename.
Without knowing more about your goals, environment and constraints it is
hard to offer any more detailed suggestions.
You could also have an application aggregate the streams and write out
chunks, with one or more writers, one per output file.


On Sat, May 9, 2009 at 12:15 AM, Sasha Dolgy <sdolgy@gmail.com> wrote:

> yes, that is the problem.  two or hundreds...data streams in very quickly.
>
> On Fri, May 8, 2009 at 8:42 PM, jason hadoop <jason.hadoop@gmail.com>
> wrote:
>
> > Is it possible that two tasks are trying to write to the same file path?
> >
> >
> > On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy <sdolgy@gmail.com> wrote:
> >
> > > Hi Tom (or anyone else),
> > > Will SequenceFile allow me to avoid problems with concurrent writes to
> > the
> > > file?  I stll continue to get the following exceptions/errors in hdfs:
> > >
> > > org.apache.hadoop.ipc.RemoteException:
> > > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> > > failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for
> > > DFSClient_-1821265528
> > > on client 127.0.0.1 because current leaseholder is trying to recreate
> > file.
> > >
> > > Only happens when two processes are trying to write at the same time.
> >  Now
> > > ideally I don't want to buffer the data that's coming in and i want to
> > get
> > > it out and into the file asap to avoid any data loss...am i missing
> > > something here?  is there some sort of factory i can implement to help
> in
> > > writing a lot of simultaneous data streams?
> > >
> > > thanks in advance for any suggestions
> > > -sasha
> > >
> > > On Wed, May 6, 2009 at 9:40 AM, Tom White <tom@cloudera.com> wrote:
> > >
> > > > Hi Sasha,
> > > >
> > > > As you say, HDFS appends are not yet working reliably enough to be
> > > > suitable for production use. On the other hand, having lots of little
> > > > files is bad for the namenode, and inefficient for MapReduce (see
> > > > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/),
> so
> > > > it's best to avoid this too.
> > > >
> > > > I would recommend using SequenceFile as a storage container for lots
> > > > of small pieces of data. Each key-value pair would represent one of
> > > > your little files (you can have a null key, if you only need to store
> > > > the contents of the file). You can also enable compression (use block
> > > > compression), and SequenceFiles are designed to work well with
> > > > MapReduce.
> > > >
> > > > Cheers,
> > > >
> > > > Tom
> > > >
> > > > On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy <sasha.dolgy@gmail.com>
> > > > wrote:
> > > > > hi there,
> > > > > working through a concept at the moment and was attempting to write
> > > lots
> > > > of
> > > > > data to few files as opposed to writing lots of data to lots of
> > little
> > > > > files.  what are the thoughts on this?
> > > > >
> > > > > When I try and implement outputStream = hdfs.append(path); there
> > > doesn't
> > > > > seem to be any locking mechanism in place ... or there is and it
> > > doesn't
> > > > > work well enough for many writes per second?
> > > > >
> > > > > i have read and seen that the property "dfs.support.append" is not
> > > meant
> > > > for
> > > > > production use.  still, if millions of little files are as good or
> > > better
> > > > > --- or no difference -- to a few massive files then i suppose
> append
> > > > isn't
> > > > > something i really need.
> > > > >
> > > > > I do see a lot of stack traces with messages like:
> > > > >
> > > > > org.apache.hadoop.ipc.RemoteException:
> > > > > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> failed
> > to
> > > > > create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528
> on
> > > > client
> > > > > 127.0.0.1 because current leaseholder is trying to recreate file.
> > > > >
> > > > > i hope this make sense.  still a little bit confused.
> > > > >
> > > > > thanks in advance
> > > > > -sd
> > > > >
> > > > > --
> > > > > Sasha Dolgy
> > > > > sasha.dolgy@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Alpha Chapters of my book on Hadoop are available
> > http://www.apress.com/book/view/9781430219422
> > www.prohadoopbook.com a community for Hadoop Professionals
> >
>
>
>
> --
> Sasha Dolgy
> sasha.dolgy@gmail.com
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message