hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-573) reduce scans/copies while reading data in hadoop streaming
Date Thu, 17 Jul 2014 21:58:06 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Allen Wittenauer updated MAPREDUCE-573:
---------------------------------------

    Labels: newbie  (was: )

> reduce scans/copies while reading data in hadoop streaming
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-573
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-573
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>            Reporter: Joydeep Sen Sarma
>              Labels: newbie
>
> follow up from: http://issues.apache.org/jira/browse/HADOOP-2826
> we copy over an entire line (from readLine) and then we break it into two strings by
splitting on tab. So there is an extra scan of the input data and an extra copy based on splitting
by tab.
> instead if we generalized LineReader to instead read until it hits a delimiter - then
we can do it with one less scan and copy. Something like:
> byte [] tabDelimiter = new byte [1]; tabDelimiter[0] = '\t';
> byte [] newlineDelimiter = new byte[2]; newlineDelimiter[0] = '\n'; newlineDelimiter[1]
= '\r';
> while() { lineReader.setDelimiter(tabDelimiter); lineReader.readLine(key); lineReader.setDelimiter(newlineDelimiter);
lineReader.readLine(value); }
> (take my proposed interfaces with a pinch of salt. just to convey the idea).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message