hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Srikanth Kakani (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3315) New binary file format
Date Mon, 28 Apr 2008 19:17:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592878#action_12592878

Srikanth Kakani commented on HADOOP-3315:

> Doug Cutting - 28/Apr/08 09:27 AM
> Do you expect to make this a drop-in replacement for SequenceFile and MapFile, or rather
something that we expect code to migrate to? I'm guessing the latter.

I think it will be the latter as well.

> So we should have a magic number there too, but, if it doesn't harm things, I'd prefer
leaving an (frequently unread) magic number in the header too.
When will the header magic be read? If it is always then wouldnt it result in two seeks anyways?
If not why do we have to complicate the format?

> Jim Kellerman - 26/Apr/08 10:19 AM > 
> Dropping the record length would seriously slow down random reads unless the index is
'complete', i.e., every key/offset is represented. If the index is sparse like MapFile's,
you would only get an approximate location of the desired record and then have to do a lot
of work to seek forward to the desired one.

Each block in this file would be memory loadable, it doesnt really matter (much) if we store
key length or not as the total operation is bounded by seek and read. Even going through the
index with variable sized keys is linear. Maybe bounding the index to one read block makes
sense aswell.

The only case this can change is if we have some metadata about the key being fixed size:
in which case all the seek-to-keys are O(1)

One more thought is an index purely based on record ids (fixed size encoded) that may keep
the index skippable/seekable. 

> New binary file format
> ----------------------
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. It would be good to have a file format that only needs 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message