hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2608) Reading sequence file consumes 100% cpu with maximum throughput being about 5MB/sec per process
Date Thu, 17 Jan 2008 06:06:38 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559812#action_12559812
] 

Runping Qi commented on HADOOP-2608:
------------------------------------


I profiled the program of reading sequence files.
It turned out that a lot of cpu was spent on deserializing the values.
The values are of a JuteRecord class having many many fields of ustring type.
The deserializing an object of that class involves calling org.apache.hadoop.record.Utils.fromBinaryString,
which is very expensive (compared with deserialization buffer class).
After I replaced the ustring type with buffer type in the jute ddl, the scan throughput improved
by 3x!
Not supprisingly, the org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect
became the most expensive operation (28% cpu spent on that call).

So, one thing we learnt here is that the cost for deserializing ustring 3x that of deserializing
buffer.
That seems to be  too huge a cost to pay for using ustring for large amount of data.

An obvious question is that is there some low hanging fruits in improving org.apache.hadoop.record.Utils.fromBinaryString?


 

> Reading sequence file consumes 100% cpu with maximum throughput being about 5MB/sec per
process
> -----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2608
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2608
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>            Reporter: Runping Qi
>
> I did some tests on the throughput of scanning block-compressed sequence files.
> The sustained throughput was bounded at 5MB/sec per process, with the cpu of each process
maxed at 100%.
> It seems to me that the cpu consumption is too high and the throughput is too low for
just scanning files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message