hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan A. P. Pendleton" ...@geekdom.net>
Subject Re: SequenceFile "pointers"
Date Sun, 04 Feb 2007 07:55:58 GMT
Seems like an interesting idea. Would be good to re-use the MapFile code,
though, since SequenceFiles aren't *required* to be sorted, it'd be likely
that some refactoring would be needed.

A default implementation would be good to try to still avoid network I/O -
making job splits that stick to as few blocks as possible, hopefully
locally-available ones.

Speaking of which - if your outputs are sorted, just write MapFiles, and
skip some of the work you're planning here.

And, finally, it's likely that, rather than an exact offset for each
key/value pair in the SequenceFile, some sort of "every N" file would
suffice... this prevents the index from getting to be especially huge if
there are lots of small keys or values, as the splits are probably going to
end up divying large numbers of key/values to each job.

On 2/3/07, Ion Badita <ion.badita@mcr.ro> wrote:
> Apparently the attachment is lost, so i put the image explaining the
> pointers to this url
> http://www.e-mistic.org/SeqencePointers.png
> John

Bryan A. P. Pendleton
Ph: (877) geek-1-bp

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message