hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ion Badita <ion.bad...@mcr.ro>
Subject SequenceFile "pointers"
Date Sun, 04 Feb 2007 07:39:34 GMT

I looked to the SequenceFile especially to the sync part, and i have an 
idea to replace sync hashes with a "pointers file" with 64 bit entries 
pointing to the beginning of every record in seq file.
I made a drawing with the explanation. Pointers could be kept in a 
separate file or appended to the end of the seq file when the file is 
closed. One gain of using pointers will be on map/reduce input 
splitting: if we need to split a seq file in 3 we could use this 
"formula": ((pointersFileSize / 8) / 3)  * 8 = index for the pointer 
entry that will give as an 64bit index to the actual record in seq file.
With pointer file we can very easy find the count for records in 
sequence file; because the pointers are 64 bytes aligned we can compute 
the sequence file like this pointersFileSize / 8 = record count. I think 
the count is useful, i personally need it.

What do you think?


  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message