accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Wed, 12 Oct 2016 19:38:55 GMT
We did discuss making this info available through the public API (and
adding thrift calls to gather it).   We discussed the possibility of
adding a new permission.

On Wed, Oct 12, 2016 at 2:35 PM, ivan bella <ivan@ivan.bella.name> wrote:
> I do not see how this invalidates any security of the system unless you are summarizing
these counters and making them available through a thrift or other call; don't do that unless
other security is put in place.  To get a summary I would think you would have to use a separate
utility to scrape the rfiles.  This metadata should only be accessible to a system administrator.
 The BIG presumption here is that is is significantly faster to grab this metadata data out
than it is to scan all of the keys in the rfile.
>
>
>> On October 12, 2016 at 1:41 PM Josh Elser <josh.elser@gmail.com> wrote:
>>
>> Thanks, Marc. Follow-on question(s) for you:
>>
>> Do you think _any_ such approach should never be pursued by Accumulo
>> (reading into your other replies about doing it outside of Accumulo)?
>> Are the permissions that we have in place not sufficient to protect such
>> "metadata"?
>>
>> Or, would such a feature be "OK" to you if it required some degree of
>> additional manual steps by the administrator? (if so, what steps do you
>> think make this acceptable)
>>
>> In a similar vein, how do you see this broadening the scope of the
>> Accumulo security model in an invalid manner? e.g. Administrators should
>> never be able to see such information. Someone with sufficient access to
>> a system would already be able to bypass Accumulo's security mechanisms.
>> There are a number of vectors already were a sufficiently-credentialed
>> individual could figure out this information (and more).
>>
>> Ultimately, I see Accumulo's main security tenet as "users should never
>> be allowed to see more data than they are authorized to see". Maybe it's
>> my interpretation of that or the scope of how your think the proposed
>> feature would function, but I'd be very interested in hearing more about
>> what you think.
>>
>> Marc P. wrote:
>>
>> > My point for discussing implementation outside of accumulo is because I
>> > think it does invalidate a core tenant
>> >
>> > On Wed, Oct 12, 2016, 12:26 PM Josh Elser<josh.elser@gmail.com> wrote:
>> >
>> > > Again, can we please bring this discussion back from discussions of
>> > > implementations to security?
>> > >
>> > > Does the fact that you three were discussing implementations imply that
>> > > you do not think this invalidates one of the core tenets (security
>> > > first) of Accumulo?
>> > >
>> > > Christopher wrote:
>> > >
>> > > > Keith, Russ, myself (and possible others) were discussing this at
the
>> > > > hackathon after the Accumulo Summit, and I think our consensus were
>> > > > basically this:
>> > > >
>> > > > We need a generic pluggable mechanism for injecting arbitrary user
>> > > > counters
>> > > > into the RFiles. We can then use these counters in custom compaction
>> > > > strategies, or other analysis. We can aggregate these counters at
the
>> > > > tablet, and table levels, and expose them in the API.
>> > > >
>> > > > These counters could store information about visibility frequencies,
>> > > > number
>> > > > of delete entries, etc.
>> > > >
>> > > > The interface might just be a Function<Entry<Key,Value>,Map<String,
Long>>.
>> > > > In the discussion, there were lots of variations on the theme, though.
>> > > > So,
>> > > > the actual implementation could vary. But, having something like this
>> > > > could
>> > > > support a large number of use cases beyond just the histogram case.
>> > > >
>> > > > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<josh.elser@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Trivially. We could do something more intelligent like also cache
it in
>> > > > > metadata (updating with compactions). Don't read too much into
the
>> > > > > implementation at this point; it was just the first idea I had
about
>> > > > > how we
>> > > > > could do it :). I'm more concerned with the idea and its security
>> > > > > implications right now.
>> > > > >
>> > > > > In general, it seems like people are ok with it protected by
a new
>> > > > > permission role. Do you have more to add, Mike? Was your comment
based
>> > > > > on
>> > > > > your interpretation of how Accumulo works or more a concern about
>> > > > > implementing such a feature?
>> > > > >
>> > > > > On Oct 11, 2016 21:29,<dlmarion@comcast.net> wrote:
>> > > > >
>> > > > > > So, to get the set of visibilities used in a table, we would
have to
>> > > > > > open
>> > > > > > all of the rfiles?
>> > > > > >
>> > > > > > > -----Original Message-----
>> > > > > > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>> > > > > > > Sent: Tuesday, October 11, 2016 3:43 PM
>> > > > > > > To: Accumulo Dev List
>> > > > > > > Subject: Re: [DISCUSS] Would a visibility histogram
on a table be
>> > > > > > > harmful?
>> > > > > > > Interesting idea. It begs the question: should we allow
any custom
>> > > > > > > index at
>> > > > > > > the RFile level? If RFile indexes were user-extensible,
then a
>> > > > > > > visibility index
>> > > > > > > would be something any developer could write. That
said, we can still
>> > > > > > > include such an index as an example, and if we did
it could be used by
>> > > > > > > the
>> > > > > > > Accumulo monitor.
>> > > > > > >
>> > > > > > > The RFile-level sampling followed this path. I would
support further
>> > > > > > > work
>> > > > > > > similar to it, though I admit I don't know how difficult
a job it
>> > > > > > > entails.
>> > > > > > > Bonus points if the index information could be accessed
from iterators
>> > > > > > > the
>> > > > > > > same way that sampled data can.
>> > > > > > >
>> > > > > > > I can't speak to the appropriateness of visibility
histograms on the
>> > > > > > > monitor
>> > > > > > > *by default*, but it would be a strictly useful feature
if it could be
>> > > > > > > enabled via
>> > > > > > > a conf option.
>> > > > > > >
>> > > > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<josh.elser@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Today at Accumulo Summit, our own Russ Weeks gave
a talk. One topic
>> > > > > > > > he
>> > > > > > > > mentioned was the lack of insight into the distribution
of data
>> > > > > > > > marked
>> > > > > > > > with certain visibilities in a table. He presented
an example similar
>> > > > > > > > to this:
>> > > > > > > > Image a hypothetical system backed by Accumulo
which stores medical
>> > > > > > > > information. There are three labels in the system:
PRIVATE,
>> > > > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which
could reasonably
>> > > > > > > > be
>> > > > > > > > considered to identify the individual. ANONYMIZED
data is some
>> > > > > > > > altered
>> > > > > > > > version of the attribute that retains some portion
of the original
>> > > > > > > > value, but is missing enough context to not identify
the individual
>> > > > > > > > (e.g. converting the name "Josh Elser" to "J E").
PUBLIC data is for
>> > > > > > > > attributes which are cannot identify the individual.
>> > > > > > > >
>> > > > > > > > Doctors would be able to read the PRIVATE data,
while researchers
>> > > > > > > > could only read the ANONYMIZED and PUBLIC data.
This leads to a
>> > > > > > > > question: how much of each kind of data is in
the system? Without
>> > > > > > > > knowing how much data is in the system, how can
some application
>> > > > > > > > developer (who does not have the ability to read
all of the PRIVATE
>> > > > > > > > data) know that their application is returning
an reasonably correct
>> > > > > > > > amount of data? (there are many examples of questions
which could be
>> > > > > > > > answer on this data alone)
>> > > > > > > >
>> > > > > > > > Concretely, this histogram would look like (50
records with PRIVATE,
>> > > > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records
total):
>> > > > > > > >
>> > > > > > > > PRIVATE: 50
>> > > > > > > > ANONYMIZED: 50
>> > > > > > > > PUBLIC: 20
>> > > > > > > >
>> > > > > > > > Technically, I think this would actually be relatively
simple to
>> > > > > > > > implement. Inside of each RFile, we could maintain
some histogram of
>> > > > > > > > the visibilities observed in that file. This would
allow us to very
>> > > > > > > > easily report how much data in each table has
each visibility label.
>> > > > > > > >
>> > > > > > > > However, would this feature be harmful to one
of the core tenants of
>> > > > > > > > Accumulo? Or, is acknowledging the existence of
data in Accumulo with
>> > > > > > > > a certain visibility acceptable? Would a new permission
to use such
>> > > > > > > > an
>> > > > > > > > API to access this information be sufficient to
protect the data?
>> > > > > > > >
>> > > > > > > > *   Josh

Mime
View raw message