hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3315) New binary file format
Date Mon, 28 Apr 2008 16:27:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592839#action_12592839
] 

Doug Cutting commented on HADOOP-3315:
--------------------------------------

> Owen: I can't see any cases where we need the key length

I think we use it when sorting.  We read raw keys into memory and call raw comparators on
them, passing the key length to the raw comparator.

The sorting code pre-dates MapReduce, though, from back when we manually partitioned, shuffled,
sorted, merged, etc. things in Nutch.  Perhaps we no longer need to sort and merge files directly,
since folks can use MapReduce for that.  Does any application code still use SequenceFile#Sorter?

> Owen: my desire is to make applications not read the header and only read the tail

So we should have a magic number there too, but, if it doesn't harm things, I'd prefer leaving
an (frequently unread) magic number in the header too.

Do you expect to make this a drop-in replacement for SequenceFile and MapFile, or rather something
that we expect code to migrate to?  I'm guessing the latter.

> Alejandro: Our use case is specifically the example he mentions, the record count.

I think we can include record-count as a base feature of this file format.  But permitting
a Map<String,String> of other metadata might also be good.



> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. It would be good to have a file format that only needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message