hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: Redundant (?) lengths in SequenceFile
Date Tue, 27 Jun 2006 03:06:53 GMT
Yikes, but this assumes that your external code can grock whatever  
strange thing is stored in the writable.  This is a non-trivial  
assumption when you go multi-lingual.

We're going to take a crack at a block compressed format soon.  This  
will vastly reduce the storage impact of this kind of issue anyway.


On Jun 26, 2006, at 4:34 PM, Paul Sutter wrote:

> I agree, there's no easy way around this one without separate  
> interfaces
> (one where the caller keeps the counts, and one where the writable  
> keeps the
> counts), and that would be silly.
>
> However -> It still seems to me that the key length in the sequence  
> file is
> redundant.  Since each key must write its own length, know its own  
> length,
> or be able to figure it out - even via the high speed interface -  
> there's no
> reason to have that key length in the file.
>
> Why do I care about 4 bytes per record? Because we're integrating an
> external sort, and right now it has to look at a record with two key
> lengths. And I assume that others (such as Yahoo) will want to  
> incorporate
> an external sort. And if we're going to be reading the sequence  
> file in
> another language, we might as well be sure about the format to use.
>
> Thanks!
>
> Paul
>
> On 6/26/06, Doug Cutting <cutting@apache.org> wrote:
>>
>> Eric Baldeschwieler wrote:
>> > Can we turn this around and assume that writables will be given  
>> a stream
>> > and a length when they read?  That would also let us remove  
>> redundant
>> > info...
>>
>> Unless I misunderstand, that would make it harder to nest writables,
>> since all containers would need to store the length.  Currently only
>> top-level containers (SequenceFile and the RPC protocol) need to  
>> write
>> lengths.  Even these are optional, used only to optimize things.
>>
>> Doug
>>


Mime
View raw message