hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2826) FileSplit.getFile(), LineRecordReader. readLine() need to be removed
Date Mon, 14 Apr 2008 22:55:05 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588835#action_12588835

Joydeep Sen Sarma commented on HADOOP-2826:

i was pulling these changes into our production environment. quick comment:

we copy over an entire line (from readLine) and then we break it into two strings by splitting
on tab. So there is an extra scan of the input data and an extra copy based on splitting by

instead if we generalized LineReader to instead read until it hits a delimiter - then we can
do it with one less scan and copy. Something like:

byte [] tabDelimiter = new byte [1]; tabDelimiter[0] = '\t';
byte [] newlineDelimiter = new byte[2]; newlineDelimiter[0] = '\n'; newlineDelimiter[1] =

while() {

getting rid of one copy and scan  would be pretty big!

> FileSplit.getFile(), LineRecordReader. readLine() need to be removed
> --------------------------------------------------------------------
>                 Key: HADOOP-2826
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2826
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.17.0
>         Attachments: patch-2826.txt, patch-2826.txt, patch-2826.txt, patch-2826.txt
> The methods FileSplit.getFile(), LineRecordReader. readLine() need to be removed as they
are deprecated.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message