accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Wed, 12 Oct 2016 17:04:47 GMT
On Wed, Oct 12, 2016 at 10:40 AM, ivan bella <ivan@ivan.bella.name> wrote:
> Yes the "owners" could create a visibility counting mechanism separately, however if
we make this RFile metadata a part of the system then we increase the "ease of use".  Unfortunately,
system designers rarely think about the metadata they need from their system up front. That
being said, if the performance impact of this is significant then it needs to be made optional
or we leave it as is.

We started dicussing generalized counting mechnism as Christopher
mentioned.  This would be created every time a file is compacted, so
it would help with the deciding up front issue.     It would have to
open all rfiles in the range.  This information could be cached in the
tserver index cache.

Below are some of the things discussed.

 * The class (and its config ) used to generate the counts is stored in RFile
 * When a user request counts, they must specify the class and config
they expect the counts to have been generated with.
 * Must decide behaviour when a tablet has RFiles with counts
generated in different ways.  Could error or return partial results.
 * Must decide behaviour when there are too many counters.  Could cap
the number of counters. When an RFile has capped counters, could
either error or return partial results.
 * If partial results are returned, then API must provide way to
indicate this to user.
 * Need to decide it counters should be maintained for data in memory.

>
>> On October 12, 2016 at 7:12 AM "Marc P." <marc.parisi@gmail.com> wrote:
>>
>> What prevents the owners of the system from doing this in their own table?
>> Keeping track of that information is a use case of Accumulo. I think this
>> may be an example of external code that the user must install. Placing the
>> onus on the consumer mitigates concern that Mike "Mike" Drob and others may
>> have .
>>
>> A new role wouldn't be needed if permissions were placed on the
>> user/table/namespace that stored this information, correct?
>>
>> On Wed, Oct 12, 2016 at 12:56 AM, Christopher <ctubbsii@apache.org> wrote:
>>
>> > Keith, Russ, myself (and possible others) were discussing this at the
>> > hackathon after the Accumulo Summit, and I think our consensus were
>> > basically this:
>> >
>> > We need a generic pluggable mechanism for injecting arbitrary user counters
>> > into the RFiles. We can then use these counters in custom compaction
>> > strategies, or other analysis. We can aggregate these counters at the
>> > tablet, and table levels, and expose them in the API.
>> >
>> > These counters could store information about visibility frequencies, number
>> > of delete entries, etc.
>> >
>> > The interface might just be a Function<Entry<Key,Value>,Map<String,
Long>>.
>> >
>> > In the discussion, there were lots of variations on the theme, though. So,
>> > the actual implementation could vary. But, having something like this could
>> > support a large number of use cases beyond just the histogram case.
>> >
>> > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <josh.elser@gmail.com> wrote:
>> >
>> > > Trivially. We could do something more intelligent like also cache it in
>> > > metadata (updating with compactions). Don't read too much into the
>> > > implementation at this point; it was just the first idea I had about how
>> > > we
>> > > could do it :). I'm more concerned with the idea and its security
>> > > implications right now.
>> > >
>> > > In general, it seems like people are ok with it protected by a new
>> > > permission role. Do you have more to add, Mike? Was your comment based
on
>> > > your interpretation of how Accumulo works or more a concern about
>> > > implementing such a feature?
>> > >
>> > > On Oct 11, 2016 21:29, <dlmarion@comcast.net> wrote:
>> > >
>> > > > So, to get the set of visibilities used in a table, we would have
to
>> > > > open
>> > > > all of the rfiles?
>> > > >
>> > > > > -----Original Message-----
>> > > > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>> > > > > Sent: Tuesday, October 11, 2016 3:43 PM
>> > > > > To: Accumulo Dev List
>> > > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table
be
>> > > > > harmful?
>> > > > >
>> > > > > Interesting idea. It begs the question: should we allow any custom
>> > > > > index at
>> > > > > the RFile level? If RFile indexes were user-extensible, then
a
>> > > > > visibility index
>> > > > > would be something any developer could write. That said, we can
>> > > > > still
>> > > > > include such an index as an example, and if we did it could be
used
>> > > > > by
>> > > > > the
>> > > > > Accumulo monitor.
>> > > > >
>> > > > > The RFile-level sampling followed this path. I would support
further
>> > > > > work
>> > > > > similar to it, though I admit I don't know how difficult a job
it
>> > > > > entails.
>> > > > > Bonus points if the index information could be accessed from
>> > > > > iterators
>> > > > > the
>> > > > > same way that sampled data can.
>> > > > >
>> > > > > I can't speak to the appropriateness of visibility histograms
on the
>> > > > > monitor
>> > > > > *by default*, but it would be a strictly useful feature if it
could
>> > > > > be
>> > > > > enabled via
>> > > > > a conf option.
>> > > > >
>> > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <josh.elser@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk.
One topic
>> > > > > > he
>> > > > > > mentioned was the lack of insight into the distribution
of data
>> > > > > > marked
>> > > > > > with certain visibilities in a table. He presented an example
>> > > > > > similar
>> > > > > > to this:
>> > > > > >
>> > > > > > Image a hypothetical system backed by Accumulo which stores
medical
>> > > > > > information. There are three labels in the system: PRIVATE,
>> > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could
reasonably
>> > > > > > be
>> > > > > > considered to identify the individual. ANONYMIZED data is
some
>> > > > > > altered
>> > > > > > version of the attribute that retains some portion of the
original
>> > > > > > value, but is missing enough context to not identify the
individual
>> > > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC
data is
>> > > > > > for
>> > > > > > attributes which are cannot identify the individual.
>> > > > > >
>> > > > > > Doctors would be able to read the PRIVATE data, while researchers
>> > > > > > could only read the ANONYMIZED and PUBLIC data. This leads
to a
>> > > > > > question: how much of each kind of data is in the system?
Without
>> > > > > > knowing how much data is in the system, how can some application
>> > > > > > developer (who does not have the ability to read all of
the PRIVATE
>> > > > > > data) know that their application is returning an reasonably
>> > > > > > correct
>> > > > > > amount of data? (there are many examples of questions which
could
>> > > > > > be
>> > > > > > answer on this data alone)
>> > > > > >
>> > > > > > Concretely, this histogram would look like (50 records with
>> > > > > > PRIVATE,
>> > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>> > > > > >
>> > > > > > PRIVATE: 50
>> > > > > > ANONYMIZED: 50
>> > > > > > PUBLIC: 20
>> > > > > >
>> > > > > > Technically, I think this would actually be relatively simple
to
>> > > > > > implement. Inside of each RFile, we could maintain some
histogram
>> > > > > > of
>> > > > > > the visibilities observed in that file. This would allow
us to very
>> > > > > > easily report how much data in each table has each visibility
>> > > > > > label.
>> > > > > >
>> > > > > > However, would this feature be harmful to one of the core
tenants
>> > > > > > of
>> > > > > > Accumulo? Or, is acknowledging the existence of data in
Accumulo
>> > > > > > with
>> > > > > > a certain visibility acceptable? Would a new permission
to use such
>> > > > > > an
>> > > > > > API to access this information be sufficient to protect
the data?
>> > > > > >
>> > > > > > *   Josh

Mime
View raw message