hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3288) Serial streaming performance should be Math.min(ideal client performance, ideal serial hdfs performance)
Date Tue, 22 Apr 2008 05:47:21 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591206#action_12591206
] 

stack commented on HADOOP-3288:
-------------------------------

> For an application like HBase, what is more important : over all throughput (transactions
per sec) or latency of serial requests?

If you put the question that way, overall throughput is more important.

Raghu, is the argument that we can't have it both ways?

Making a guess, I'd say that the character of reading on HBase clusters will be predominantly
random reads with some much smaller number of concurrent scans of all or parts of tables (a
'scan' implies serial reading of files).  In the HBase case, we go out of our way to keep
files in HDFS small, never > 256M or so (about 2-4 or 5 blocks).



> Serial streaming performance should be Math.min(ideal client performance, ideal serial
hdfs performance)
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3288
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3288
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.3, 0.18.0
>         Environment: Mac OS X  10.5.2, Java 6
>            Reporter: Sam Pullara
>             Fix For: 0.18.0
>
>
> I looked at all the code long and hard and this was my analysis (could be wrong, I'm
not an expert on this codebase):
> Current Serial HDFS performance = Average Datanode Performance
> Average Datanode Performance = Average Disk Performance (even if you have more than one)
> We should have:
> Ideal Serial HDFS Performance = Sum of Ideal Datanode Performance
> Ideal Datanode Performance = Sum of disk performance
> When you read a single file serially from HDFS there are a number of limitations that
come into play:
> 1) Blocks on multiple datanodes will be load balanced between them - averaging the performance
of the datanodes
> 2) Blocks on multiple disks in a single datanode are load balanced between them - averaging
the performance of the disks
> I think that all this could be fixed if we actually prefetched fully read blocks on the
client until the client can no longer keep up with the data or there is another bottleneck
like network bandwidth.
> This seems like a reasonably common use case though not the typical MapReduce case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message