hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Somogyi <psomo...@apache.org>
Subject Re: [DISCUSS] Gathering metrics on HBase versions in use
Date Thu, 15 Nov 2018 08:48:38 GMT
I like the idea to have some sort of metrics from the users.

I agree with Allan that in many cases HBase cluster is in an internal
network making the data collection difficult or not even possible. It could
lead us to an incorrect view if these generally bigger clusters do not
appear in the metrics but is full with stats from standalone test
environments that were just started once and never again.

Collecting download information could give us a better picture but in this
statistics the latest version might be overrepresented and we won't know
which releases are currently used in the field.

What do you think about collecting page views of Reference guide tied to
specific releases? Someone searching in 1.4 Ref guide probably using HBase
1.4 or in the process of setting it up.

Thanks,
Peter

On Thu, Nov 15, 2018 at 4:56 AM 张铎(Duo Zhang) <palomino219@gmail.com> wrote:

> +1 on collecting the download information.
>
> And collecting data when starting up is a bit dangerous I'd say, both
> technically and legally...
>
> Maybe a possible way is to add a link on the master state page,  or some
> ASCII arts in the master start log, to guide the people to our survey?
>
> Allan Yang <allan163@apache.org> 于2018年11月15日周四 上午11:23写道:
>
> > I also think having metrics about the downloads from Apache/archives is a
> > doable action. Most HBase clusters are running in user's Intranet with no
> > public access, sending anonymous data from them may not be possible. And
> > also we need to find a way to obtain their authorization I think...
> > Best Regards
> > Allan Yang
> >
> > Zach York <zyork.contribution@gmail.com> 于2018年11月15日周四 上午5:35写道:
> >
> > > Can we have metrics around the downloads from Apache/archives? I'm not
> > sure
> > > how that is all set up, but might be a low cost way to get some
> metrics.
> > >
> > > On Wed, Nov 14, 2018, 12:12 PM Andrew Purtell <apurtell@apache.org
> > wrote:
> > >
> > > > While it seems you are proposing some kind of autonomous ongoing
> usage
> > > > metrics collection, please note I ran an anonymous version usage
> survey
> > > via
> > > > surveymonkey for 1.x last year. It was opt in and there were no PII
> > > > concerns by its nature. All of the issues around data collection,
> > > storage,
> > > > and processing were also handled (by surveymonkey). Unfortunately I
> > > > recently cancelled my account.
> > > >
> > > > For occasional surveys something like that might work. Otherwise
> there
> > > are
> > > > a ton of questions: How do we generate the data? How do we get
> per-site
> > > > opt-in permission? How do we collect the data? Store it? Process it?
> > > Audit
> > > > it? Seems more trouble than it's worth and requires ongoing volunteer
> > > > hosting and effort to maintain.
> > > >
> > > >
> > > > On Wed, Nov 14, 2018 at 11:47 AM Misty Linville <misty@apache.org>
> > > wrote:
> > > >
> > > > > When discussing the 2.0.x branch in another thread, it came up that
> > we
> > > > > don’t have a good way to understand the version skew of HBase
> across
> > > the
> > > > > user base. Metrics gathering can be tricky. You don’t want to
> capture
> > > > > personally identifiable information (PII) and you need to be
> > > transparent
> > > > > about what you gather, for what purpose, how long the data will be
> > > > > retained, etc. The data can also be sensitive, for instance if a
> > large
> > > > > number of installations are running a version with a CVE or known
> > > > > vulnerability against it. If you gather metrics, it really needs
to
> > be
> > > > > opt-out rather than opt-in so that you actually get a reasonable
> > amount
> > > > of
> > > > > data. You also need to stand up some kind of metrics-gathering
> > service
> > > > and
> > > > > run it somewhere, and some kind of reporting / visualization
> tooling.
> > > The
> > > > > flip side of all these difficulties is a more intelligent way to
> > decide
> > > > > when to retire a branch or when to communicate more broadly /
> loudly
> > > > asking
> > > > > people in a certain version stream to upgrade, as well as where to
> > > > > concentrate our efforts.
> > > > >
> > > > > I’m not sticking my hand up to implement such a monster. I only
> > wanted
> > > to
> > > > > open a discussion and see what y’all think. It seems to me that
a
> few
> > > > > must-haves are:
> > > > >
> > > > > - Transparency: Release notes, logging about the status of
> > > > > metrics-gathering (on or off) at master or RS start-up, logging
> about
> > > > > exactly when and what metrics are sent
> > > > > - Low frequency: Would we really need to wake up and send metrics
> > more
> > > > > often than weekly?
> > > > > - Conservative approach: Only collect what we can find useful
> today,
> > > > don’t
> > > > > collect the world.
> > > > > - Minimize PII: This probably means not trying to group together
> > > > > time-series results for a given server or cluster at all, but could
> > > make
> > > > > the data look like there were a lot more clusters running in the
> > world
> > > > than
> > > > > really are.
> > > > > - Who has access to the data? Do we make it public or limit access
> to
> > > the
> > > > > PMC? Making it public would bolster our discipline about
> transparency
> > > and
> > > > > minimizing PII.
> > > > >
> > > > > I’m sure I’m missing a ton so I leave the discussion to y’all.
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrew
> > > >
> > > > Words like orphans lost among the crosstalk, meaning torn from
> truth's
> > > > decrepit hands
> > > >    - A23, Crosstalk
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message