hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason hadoop <jason.had...@gmail.com>
Subject Re: SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)
Date Tue, 03 Feb 2009 03:57:21 GMT
If you have to do a time based solution, for now, simply close the file and
stage it, then open a new file.
Your reads will have to deal with the fact the file is in multiple parts.
Warning: Datanodes get pokey if they have large numbers of blocks, and the
quickest way to do this is to create a lot of small files.

On Mon, Feb 2, 2009 at 9:54 AM, Brian Long <brian@dotspots.com> wrote:

> Let me rephrase this problem... as stated below, when I start writing to a
> SequenceFile from an HDFS client, nothing is visible in HDFS until I've
> written 64M of data. This presents three problems: fsck reports the file
> system as corrupt until the first block is finally written out, the
> presence
> of the file (without any data) seems to blow up my mapred jobs that try to
> make use of it under my input path, and finally, I want to basically flush
> every 15 minutes or so so I can mapred the latest data.
> I don't see any programmatic way to force the file to flush in 17.2.
> Additionally, "dfs.checkpoint.period" does not seem to be obeyed. Does that
> not do what I think it does? What controls the 64M limit, anyway? Is it
> "dfs.checkpoint.size" or "dfs.block.size"? Is the buffering happening on
> the
> client, or on data nodes? Or in the namenode?
> It seems really bad that a SequenceFile, upon creation, is in an unusable
> state from the perspective of a mapred job, and also leaves fsck in a
> corrupt state. Surely I must be doing something wrong... but what? How can
> I
> ensure that a SequenceFile is immediately usable (but empty) on creation,
> and how can I make things flush on some regular time interval?
> Thanks,
> Brian
> On Thu, Jan 29, 2009 at 4:17 PM, Brian Long <brian@dotspots.com> wrote:
> > I have a SequenceFile.Writer that I obtained via
> SequenceFile.createWriter
> > and write to using append(key, value). Because the writer volume is low,
> > it's not uncommon for it to take over a day for my appends to finally be
> > flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
> > Because I am running map/reduce tasks on this data multiple times a day,
> I
> > want to "flush" the sequence file so the mapred jobs can pick it up when
> > they run.
> > What's the right way to do this? I'm assuming it's a fairly common use
> > case. Also -- are writes to the sequence files atomic? (e.g. if I am
> > actively appending to a sequence file, is it always safe to read from
> that
> > same file in a mapred job?)
> >
> > To be clear, I want the flushing to be time based (controlled explicitly
> by
> > the app), not size based. Will this create waste in HDFS somehow?
> >
> > Thanks,
> > Brian
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message