hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12949) Add HTrace to the s3a connector
Date Tue, 22 Mar 2016 10:11:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206107#comment-15206107

Steve Loughran commented on HADOOP-12949:

There's actually some metrics collection in openstack swift; look under {{org.apache.hadoop.fs.swift.util.DurationStats}}
; they log primarily to stdout, list min, max, (moving) arithmetic mean, stddev,, by HTTP

# It's pretty low cost to do this; even when hbase sampling is inactive, the stats for an
FS can be collected.
# The stats showed that rackspace UK throttles delete requests; the more files in a directory
I was cleaning up on teardown, the longer it took —only now exponentially, rather than linearly.
# I didn't hook the code up to the normal hadoop metrics; it's something I'd as an option
now, because it does become something you need to monitor now we are shifting to longer-lived
# I'd add more on causes of operations, specifically: open(), seek(), duration of close(),
delete() —things where the fact that object stores are generally O(files*data) means they
don't work as expected ... finding that mismatch of expectations matters

More and more object stores are coming in. While s3 is the main one, it'd be good to have
the core stuff store neutral. The classes from hadoop-openstack can be moved if that helps;
the per-verb stuff is useful at the deep levels, while htrace monitoring can track cost of
specific actions.

> Add HTrace to the s3a connector
> -------------------------------
>                 Key: HADOOP-12949
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12949
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Madhawa Gunasekara
> Hi All, 
> s3, GCS, WASB, and other cloud blob stores are becoming increasingly important in Hadoop.
But we don't have distributed tracing for these yet. It would be interesting to add distributed
tracing here. It would enable collecting really interesting data like probability distributions
of PUT and GET requests to s3 and their impact on MR jobs, etc.
> I would like to implement this feature, Please shed some light on this 
> Thanks,
> Madhawa

This message was sent by Atlassian JIRA

View raw message