hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-2285) TextInputFormat is slow compared to reading files.
Date Thu, 27 Dec 2007 00:18:43 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Owen O'Malley updated HADOOP-2285:

    Attachment: fast-line.patch

Ok, here is a patch that does:
  1. Avoids encoding the data as a string and stores it directly in the Text object, which
avoids the encode/decode cycle.
  2. Merges the buffer and the readLine code, so that it can use direct buffer access.
  3. Adds a new method to Text to append bytes.
  4. Adds a new method to Text to clear back to an empty string.
  5. Adds test cases for the new functionality.

Using the benchmark on HADOOP-2406, I see a 3x speed up using TextInputFormat. (32 seconds
down to 11). This should be a big win for any jobs that scan a lot of text data.

> TextInputFormat is slow compared to reading files.
> --------------------------------------------------
>                 Key: HADOOP-2285
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2285
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.0
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.16.0
>         Attachments: fast-line.patch
> The LineRecordReader reads from the source byte by byte, which seems to be half as fast
as if the readLine method was defined on the memory buffer directly instead of as an InputStream.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message