hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-3255) reduce scans/copies while reading data in hadoop streaming
Date Tue, 15 Apr 2008 00:01:06 GMT
reduce scans/copies while reading data in hadoop streaming
----------------------------------------------------------

                 Key: HADOOP-3255
                 URL: https://issues.apache.org/jira/browse/HADOOP-3255
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/streaming
    Affects Versions: 0.16.2
            Reporter: Joydeep Sen Sarma


follow up from: http://issues.apache.org/jira/browse/HADOOP-2826

we copy over an entire line (from readLine) and then we break it into two strings by splitting
on tab. So there is an extra scan of the input data and an extra copy based on splitting by
tab.

instead if we generalized LineReader to instead read until it hits a delimiter - then we can
do it with one less scan and copy. Something like:

byte [] tabDelimiter = new byte [1]; tabDelimiter[0] = '\t';
byte [] newlineDelimiter = new byte[2]; newlineDelimiter[0] = '\n'; newlineDelimiter[1] =
'\r';

while() { lineReader.setDelimiter(tabDelimiter); lineReader.readLine(key); lineReader.setDelimiter(newlineDelimiter);
lineReader.readLine(value); }

(take my proposed interfaces with a pinch of salt. just to convey the idea).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message