hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-4710) SCR should honor dfs.client.read.shortcircuit.buffer.size even when checksums are off
Date Thu, 20 Jun 2013 18:44:20 GMT

     [ https://issues.apache.org/jira/browse/HDFS-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Colin Patrick McCabe updated HDFS-4710:

    Attachment: HDFS-4710.001.patch

This patch fixes it so that SCR honors {{dfs.client.read.shortcircuit.buffer.size}} even when
checksums are off.

I noticed that there wasn't really that much code shared between the checksum / no-checksum
paths, so I factored them out into separate classes.  This gets rid of the if (checksum)...
that we had everywhere.

Some users might want the existing "no-copy" behavior where we always read directly into the
provided buffer.  They can continue to get that behavior by setting {{dfs.client.read.shortcircuit.buffer.size}}
to 0 and {{dfs.client.read.shortcircuit.skip.checksum}} to true.
> SCR should honor dfs.client.read.shortcircuit.buffer.size even when checksums are off
> -------------------------------------------------------------------------------------
>                 Key: HDFS-4710
>                 URL: https://issues.apache.org/jira/browse/HDFS-4710
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.0.4-alpha
>         Environment: Centos (EC2) + short-circuit reads on
>            Reporter: Gopal V
>            Priority: Minor
>              Labels: perfomance
>         Attachments: HDFS-4710.001.patch
> When short-circuit reads are on, HDFS client slows down when checksums are turned off.
> With checksums on, the query takes 45.341 seconds and with it turned off, it takes 56.345
seconds. This is slower than the speeds observed when short-circuiting is turned off.
> The issue seems to be that FSDataInputStream.readByte() calls are directly transferred
to the disk fd when the checksums are turned off.
> Even though all the columns are integers, the data being read will be read via DataInputStream
which does
> {code}
> public final int readInt() throws IOException {
>         int ch1 = in.read();
>         int ch2 = in.read();
>         int ch3 = in.read();
>         int ch4 = in.read();
> {code}
> To confirm, an strace of the Yarn container shows
> {code}
> 26690 read(154, "B", 1)                 = 1
> 26690 read(154, "\250", 1)              = 1
> 26690 read(154, ".", 1)                 = 1
> 26690 read(154, "\24", 1)               = 1
> {code}
> To emulate this without the entirety of Hive code, I have written a simpler test app

> https://github.com/t3rmin4t0r/shortcircuit-reader
> The jar will read a file in -bs <n> sized buffers. Running it with 1 byte blocks
gives similar results to the Hive test run.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message