hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6982) nntop: top­-like tool for name node users
Date Wed, 12 Nov 2014 02:03:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207528#comment-14207528

Andrew Wang commented on HDFS-6982:

Hi Maysam, I did another round of review. Mostly usability stuff and small cleanups, core
functionality as before still looks great. As always, thanks for your patience.

TopConf / DFSConfigKeys / hdfs-default.xml
* Can use Configuration.getInts rather than parsing yourself, also would be good to do basic
validation that the intervals are >0 mins via a Precondition check.
* Do you mind writing out the key strings in DFSConfigKeys? It's how we do it in the rest
of the file, and more readable.
* Consider renaming "windowmanager.topsize" to "num.users"
* Consider renaming "window.buckets" to "window.num.buckets" and "periods.min" to "windows.minutes"
* We could set the default value to {{int[] {1,5,25}}} to make the logic a little easier.
I noticed that the default in hdfs-default and DFSConfigKeys is different too.
* Seems like 30min would be a more human-friendly number than 25min also
* hdfs-default.xml typo: "seperated" -> "separated"
* I noticed that NameNodeMetrics fills in sessionId and processId, any reason we can't do
the same?
* Would be cool to INFO log the configuration settings at startup, this can be pretty handy
* It also seems like an unnecessary step to also have to specify the TopAuditLogger in the
conf, if a user already specified {{dfs.namenode.top.periods.min}}. If there are periods set,
let's just also create the TopAuditLogger.

* Should fill out the DummyDescription to not be dummy :)
* Some unused methods: initSingleton(Configuration, String, String), getReportingPeriods,
isTopMetricRecord, extractPeriodFormMetricRecordName. Guessing these are used in the HTML
view, could we remove these for now?
* In getMetrics, would prefer the info log to be debug, and the error to be warn. Error is
normally for something nigh fatal, warn is something we can ride out.
* The username string parsing is a little brittle, would be good to put a little static helper
method in UserGroupInformation next to toString in case someone changes the format.

* We can make the asserts in the constructor into Precondition checks, we want these even
in production

* Is it possible to use monotonic time, i.e. Time.monotonicNow, rather than system time? We've
been bitten by clocks moving around before, and I think relative time should work here. I
see we make it into a Date in RollingWindow#getSum, but just for debugging.
* Unused import

Experiences running it:
* The jmx page right now only exposes the shortest window, I see the {{smallestWindow}} hardcode.
Since the goal is to use jmx to populate the HTML view, we need to somehow expose all of the
configured windows.
* I tried doing some operations, then waited for the window to pass. The counts never returned
to zero. I think this is because in RWManager#snapshot, we have a continue if {{n == 0}},
so never push the 0 count.
* It's also unfortunate that there's no way to remove a metric once we add it, so over time
we'll end up with a lot of garbage {{user.operation}} strings that don't apply to the current
window. Thoughts welcome here.

> nntop: top­-like tool for name node users
> -----------------------------------------
>                 Key: HDFS-6982
>                 URL: https://issues.apache.org/jira/browse/HDFS-6982
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Maysam Yabandeh
>            Assignee: Maysam Yabandeh
>         Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, HDFS-6982.v4.patch,
HDFS-6982.v5.patch, HDFS-6982.v6.patch, nntop-design-v1.pdf
> In this jira we motivate the need for nntop, a tool that, similarly to what top does
in Linux, gives the list of top users of the HDFS name node and gives insight about which
users are sending majority of each traffic type to the name node. This information turns out
to be the most critical when the name node is under pressure and the HDFS admin needs to know
which user is hammering the name node and with what kind of requests. Here we present the
design of nntop which has been in production at Twitter in the past 10 months. nntop proved
to have low cpu overhead (< 2% in a cluster of 4K nodes), low memory footprint (less than
a few MB), and quite efficient for the write path (only two hash lookup for updating a metric).

This message was sent by Atlassian JIRA

View raw message