hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Krogen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-10872) Add MutableRate metrics for FSNamesystemLock operations
Date Tue, 04 Oct 2016 00:23:20 GMT

     [ https://issues.apache.org/jira/browse/HDFS-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Erik Krogen updated HDFS-10872:
-------------------------------
    Attachment: FSLockPerf.java

Some comments on performance considerations...

I am attaching code ({{FSLockPerf.java}}) that I used to do somewhat rudimentary microbenchmarking.
It's not perfect but hopefully it gives a bit of an idea of what kind of overhead this may
incur. If anyone is interested in seeing other numbers let me know and I will do my best to
generate them. Note that this feature is disabled by default, so no overhead is incurred for
those not actively opting in to the feature.

I include two different tests, "overall" and "aggTime". In both I focus on the worst case
scenario in which all threads are reader threads, i.e. they are not hindered by the Namesystem
lock and contend solely on metrics. In both cases I use 200 threads to model what would occur
in a highly contested system. Also, all aggregations involve 50 operations, emulating 50 distinct
operation types occurring at each thread since the last aggregation, which seems a conservatively
high upper bound since most operations are uncommon.  

overall tries to be more wholistic but involves a higher degree of variability since there
are actually locks being held and such. This test sets the aggregation interval at various
intervals (including completely disabled and a high enough interval that aggregation is never
triggered) and tests the overall time it takes each of the 200 threads to complete 500,000
cycles of read lock/unlock (including all metrics-related operations). Over 1,000 iterations
I got:
{code}
Agg Interval    Total Time MS (Avg)     Total Time MS (StdDev)
0       30518   1777
9999999 30825   1673
20000   30183   1709
10000   30272   1681
5000    30278   1740
1000    30307   1702
10      30350   1692
{code}
Clearly the metrics processing fits within the noise of locking and such, especially given
that the average of the runs with the logic disabled ended up being higher than with the logic
enabled. Still, these results were not very satisfying, so I tried to be more specific with
aggTime.

aggTime is the more narrow of the two. I assume the local tracking of metrics is very cheap,
simply incrementing a counter within a ThreadLocal, so I focus on the time to do the more
expensive aggregate (involving a {{synchronized}} method to update the {{MutableRate}} metric).
First I run a test with only a single thread updating metrics, then do the full 200 threads
under a few different conditions: turning on and off aggregation (to get a baseline figure
of performance with many threads running), and including a 1-millisecond sleep between operations
(to emulate slightly less pessimistic conditions of lock contention). Each thread does 10,000
aggregations and I measure the time per operation; over 10 trials I got:
{code}
10000 aggregation per thread over 100 trials
Test    Average Time (ns)       Std Dev (ns)
Single Thread   3107    606
No Agg, No Wait 551     551
Agg, No Wait    235850  24059
No Agg, Wait    1065525 625
Agg, Wait       1158477 8743
{code}
So it seems that even under highly contested conditions an aggregation adds ~100-200 microseconds
to the execution path, and without contention only ~3-4 microseconds. Given that a typical
aggregation period would be the same as the metrics collection interval, say 10-60 seconds,
this seems reasonable for a disabled-by-default feature.  


> Add MutableRate metrics for FSNamesystemLock operations
> -------------------------------------------------------
>
>                 Key: HDFS-10872
>                 URL: https://issues.apache.org/jira/browse/HDFS-10872
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: namenode
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>         Attachments: FSLockPerf.java, HDFS-10872.000.patch
>
>
> Add metrics for FSNamesystemLock operations to see, overall, how long each operation
is holding the lock for. Use MutableRate metrics for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message