hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Sutter" <sut...@gmail.com>
Subject Re: Redundant (?) lengths in SequenceFile
Date Mon, 26 Jun 2006 23:34:52 GMT
I agree, there's no easy way around this one without separate interfaces
(one where the caller keeps the counts, and one where the writable keeps the
counts), and that would be silly.

However -> It still seems to me that the key length in the sequence file is
redundant.  Since each key must write its own length, know its own length,
or be able to figure it out - even via the high speed interface - there's no
reason to have that key length in the file.

Why do I care about 4 bytes per record? Because we're integrating an
external sort, and right now it has to look at a record with two key
lengths. And I assume that others (such as Yahoo) will want to incorporate
an external sort. And if we're going to be reading the sequence file in
another language, we might as well be sure about the format to use.



On 6/26/06, Doug Cutting <cutting@apache.org> wrote:
> Eric Baldeschwieler wrote:
> > Can we turn this around and assume that writables will be given a stream
> > and a length when they read?  That would also let us remove redundant
> > info...
> Unless I misunderstand, that would make it harder to nest writables,
> since all containers would need to store the length.  Currently only
> top-level containers (SequenceFile and the RPC protocol) need to write
> lengths.  Even these are optional, used only to optimize things.
> Doug

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message