hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2608) Reading sequence file consumes 100% cpu with maximum throughput being about 5MB/sec per process
Date Thu, 17 Jan 2008 06:06:38 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559812#action_12559812

Runping Qi commented on HADOOP-2608:

I profiled the program of reading sequence files.
It turned out that a lot of cpu was spent on deserializing the values.
The values are of a JuteRecord class having many many fields of ustring type.
The deserializing an object of that class involves calling org.apache.hadoop.record.Utils.fromBinaryString,
which is very expensive (compared with deserialization buffer class).
After I replaced the ustring type with buffer type in the jute ddl, the scan throughput improved
by 3x!
Not supprisingly, the org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect
became the most expensive operation (28% cpu spent on that call).

So, one thing we learnt here is that the cost for deserializing ustring 3x that of deserializing
That seems to be  too huge a cost to pay for using ustring for large amount of data.

An obvious question is that is there some low hanging fruits in improving org.apache.hadoop.record.Utils.fromBinaryString?


> Reading sequence file consumes 100% cpu with maximum throughput being about 5MB/sec per
> -----------------------------------------------------------------------------------------------
>                 Key: HADOOP-2608
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2608
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>            Reporter: Runping Qi
> I did some tests on the throughput of scanning block-compressed sequence files.
> The sustained throughput was bounded at 5MB/sec per process, with the cpu of each process
maxed at 100%.
> It seems to me that the cpu consumption is too high and the throughput is too low for
just scanning files.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message