hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jothi Padmanabhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3514) Reduce seeks during shuffle, by inline crcs
Date Wed, 20 Aug 2008 07:10:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623918#action_12623918
] 

Jothi Padmanabhan commented on HADOOP-3514:
-------------------------------------------

bq. There is a class DataChecksum in org.apache.hadoop.util. We probably should use it here.

DataChecksum is intended for the typical use case of having a checkum for a chunk (bytesPerSum).
In IFile, the intent is to have one Checksum per file.  The variable 'bytesPerSum' in DataChecksum,
as the name indicates, is the bytes for which a checksum is calculated. However, it is primarily
up to the user of the DataChecksum Class to use this appropriately, inside DataCheckusm.java,
bytesPerSum is used only during the constructor and while generating the header. Since IFile
does not worry about DataChecksum.header, we could still use DataChecksum from inside IFIle
by passing any arbitrary value for bytesPerSum in the constructor. Note that we do not know
the length of the file a priori, so we are constrained to pass a dummy value. There is one
modification needed in the DataChecskum.java though -- we need to remove the following assert
in the update function . There is already a comment that it can be removed. Is it OK to remove
this assert?

{code}
    // Can be removed.
    assert inSum <= bytesPerChecksum : "DataChecksum.update() : inSum " + 
                inSum + " > " + " bytesPerChecksum " + bytesPerChecksum ; 
{code}

bq. It may be better to have ChecksumOutputStream extending FilterOutputStream, instead of
OutputStream

Yes, I will make this change.

> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
>                 Key: HADOOP-3514
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3514
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3514-v1.patch, hadoop-3514-v2.patch, hadoop-3514-v3.patch,
hadoop-3514-v4.patch, hadoop-3514-v5.patch, hadoop-3514-v6.patch, hadoop-3514-v7.patch, hadoop-3514-v8.patch,
hadoop-3514.patch
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc into the iFile
rather than having a separate file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message