hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6708) New file format for very large records
Date Fri, 16 Apr 2010 00:53:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857627#action_12857627
] 

Hong Tang commented on HADOOP-6708:
-----------------------------------

bq. I don't think that works in this scenario. Suppose I have a record that is 8 GB long;
I read the first kilobyte or two out of the record, then intend to discard the rest and start
with the next record.

Your analysis is almost right. However, there is one optimization could be done to support
this: TFile does block compression, and an 8GB record is likely to exceed the TFIle block
size after compression (unless you have something like all zeros). So it would be the last
record in the block. And we can speed up the skipping of the last record in a block by positioning
the cursor to the beginning of the next block without chunk decoding. On the other hand, if
your 8GB actually compresses very well within one TFile block (256KB by default), then it
is really a sequential read of 256KB from HDFS.

>From your description, it seems that you do not plan to use compression, which sounds
a bit surprising to me...



> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message