hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10175) add per-operation stats to FileSystem.Statistics
Date Fri, 22 Apr 2016 19:26:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254503#comment-15254503
] 

Colin Patrick McCabe commented on HDFS-10175:
---------------------------------------------

bq. Can I also note that as the @Public @Stable FileSystem is widely subclassed, with its
protected statistics field accessed in those subclasses, nobody is allowed to take it or its
current methods away. Thanks.

Yeah, I agree.  I would like to see us get more cautious about adding new things to {{FileSystem#Statistics}},
though, since I think it's not a good match for most of the new stats we're proposing here.

bq. There's no per-thread tracking, —its collecting overall stats, rather than trying to
add up the cost of a single execution, which is what per-thread stuff would, presumably do.
This is lower cost but still permits microbenchmark-style analysis of performance problems
against S3a. It doesn't directly let you get results of a job, "34MB of data, 2000 stream
aborts, 1998 backward seeks" which are the kind of things I'm curious about.

Overall stats are lower cost in terms of memory consumption, and the cost to read (as opposed
to update) a metric.  They are higher cost in terms of the CPU consumed for each update of
the metric.  In particular, for applications that do a lot of stream operations from many
different threads, updating an AtomicLong can become a performance bottleneck

One of the points that I was making above is that I think it's appropriate for some metrics
to be tracked per-thread, but for others, we probably want to use AtomicLong or similar. 
I would expect that anything that led to an s3 RPC would be appropriate to be tracked by an
AtomicLong very easily, since the overhead of the network activity would dwarf the AtomicLong
update overhead.  And we should have a common interface for getting this information that
MR and stats consumers can use.

bq. Maybe, and this would be nice, whatever is implemented here is (a) extensible to support
some duration type too, at least in parallel, 

The interface here supports storing durations as 64-bit numbers of milliseconds, which seems
good.  It is up to the implementor of the statistic to determine what the 64-bit long represents
(average duration in ms, median duration in ms, number of RPCs, etc. etc.)  This is similar
to metrics2 and jmx, etc. where you have basic types that can be used in a few different ways.

bq. and (b) could be used as a back end by both Metrics2 and Coda Hale metrics registries.
That way the slightly more expensive metric systems would have access to this more raw data.

Sure.  The difficult question is how metrics2 hooks up to metrics which are per FS or per-stream.
 Should the output of metrics2 reflect the union of all existing FS and stream instances?
 Some applications open a very large number of streams, so it seems impractical for metrics2
to include all those streams in its output.

> add per-operation stats to FileSystem.Statistics
> ------------------------------------------------
>
>                 Key: HDFS-10175
>                 URL: https://issues.apache.org/jira/browse/HDFS-10175
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>            Reporter: Ram Venkatesh
>            Assignee: Mingliang Liu
>         Attachments: HDFS-10175.000.patch, HDFS-10175.001.patch, HDFS-10175.002.patch,
HDFS-10175.003.patch, HDFS-10175.004.patch, HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java
>
>
> Currently FileSystem.Statistics exposes the following statistics:
> BytesRead
> BytesWritten
> ReadOps
> LargeReadOps
> WriteOps
> These are in-turn exposed as job counters by MapReduce and other frameworks. There is
logic within DfsClient to map operations to these counters that can be confusing, for instance,
mkdirs counts as a writeOp.
> Proposed enhancement:
> Add a statistic for each DfsClient operation including create, append, createSymlink,
delete, exists, mkdirs, rename and expose them as new properties on the Statistics object.
The operation-specific counters can be used for analyzing the load imposed by a particular
job on HDFS. 
> For example, we can use them to identify jobs that end up creating a large number of
files.
> Once this information is available in the Statistics object, the app frameworks like
MapReduce can expose them as additional counters to be aggregated and recorded as part of
job summary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message