accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Wed, 12 Oct 2016 20:09:47 GMT
I think SystemPermission.SYSTEM permission should probably be required for
any public API retrieving this data. It is, after all, code run on servers,
generating data directly from the RFiles. This would also imply that
caution is needed if we were to cache the data in, say, the metadata table.

On Wed, Oct 12, 2016 at 3:58 PM Josh Elser <josh.elser@gmail.com> wrote:

> I was envisioning public API protected by a system permission (implying
> some Thrift RPC as well) if that is an important distinction for those
> with concerns. I am hoping to get more info from Mike/Marc about why
> they feel this is insufficient WRT Accumulo's security model.
>
> Keith Turner wrote:
> > We did discuss making this info available through the public API (and
> > adding thrift calls to gather it).   We discussed the possibility of
> > adding a new permission.
> >
> > On Wed, Oct 12, 2016 at 2:35 PM, ivan bella<ivan@ivan.bella.name>
> wrote:
> >> I do not see how this invalidates any security of the system unless you
> are summarizing these counters and making them available through a thrift
> or other call; don't do that unless other security is put in place.  To get
> a summary I would think you would have to use a separate utility to scrape
> the rfiles.  This metadata should only be accessible to a system
> administrator.  The BIG presumption here is that is is significantly faster
> to grab this metadata data out than it is to scan all of the keys in the
> rfile.
> >>
> >>
> >>> On October 12, 2016 at 1:41 PM Josh Elser<josh.elser@gmail.com>
> wrote:
> >>>
> >>> Thanks, Marc. Follow-on question(s) for you:
> >>>
> >>> Do you think _any_ such approach should never be pursued by Accumulo
> >>> (reading into your other replies about doing it outside of Accumulo)?
> >>> Are the permissions that we have in place not sufficient to protect
> such
> >>> "metadata"?
> >>>
> >>> Or, would such a feature be "OK" to you if it required some degree of
> >>> additional manual steps by the administrator? (if so, what steps do you
> >>> think make this acceptable)
> >>>
> >>> In a similar vein, how do you see this broadening the scope of the
> >>> Accumulo security model in an invalid manner? e.g. Administrators
> should
> >>> never be able to see such information. Someone with sufficient access
> to
> >>> a system would already be able to bypass Accumulo's security
> mechanisms.
> >>> There are a number of vectors already were a sufficiently-credentialed
> >>> individual could figure out this information (and more).
> >>>
> >>> Ultimately, I see Accumulo's main security tenet as "users should never
> >>> be allowed to see more data than they are authorized to see". Maybe
> it's
> >>> my interpretation of that or the scope of how your think the proposed
> >>> feature would function, but I'd be very interested in hearing more
> about
> >>> what you think.
> >>>
> >>> Marc P. wrote:
> >>>
> >>>> My point for discussing implementation outside of accumulo is because
> I
> >>>> think it does invalidate a core tenant
> >>>>
> >>>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<josh.elser@gmail.com>
> wrote:
> >>>>
> >>>>> Again, can we please bring this discussion back from discussions
of
> >>>>> implementations to security?
> >>>>>
> >>>>> Does the fact that you three were discussing implementations imply
> that
> >>>>> you do not think this invalidates one of the core tenets (security
> >>>>> first) of Accumulo?
> >>>>>
> >>>>> Christopher wrote:
> >>>>>
> >>>>>> Keith, Russ, myself (and possible others) were discussing this
at
> the
> >>>>>> hackathon after the Accumulo Summit, and I think our consensus
were
> >>>>>> basically this:
> >>>>>>
> >>>>>> We need a generic pluggable mechanism for injecting arbitrary
user
> >>>>>> counters
> >>>>>> into the RFiles. We can then use these counters in custom compaction
> >>>>>> strategies, or other analysis. We can aggregate these counters
at
> the
> >>>>>> tablet, and table levels, and expose them in the API.
> >>>>>>
> >>>>>> These counters could store information about visibility frequencies,
> >>>>>> number
> >>>>>> of delete entries, etc.
> >>>>>>
> >>>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
> Long>>.
> >>>>>> In the discussion, there were lots of variations on the theme,
> though.
> >>>>>> So,
> >>>>>> the actual implementation could vary. But, having something
like
> this
> >>>>>> could
> >>>>>> support a large number of use cases beyond just the histogram
case.
> >>>>>>
> >>>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<josh.elser@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Trivially. We could do something more intelligent like also
cache
> it in
> >>>>>>> metadata (updating with compactions). Don't read too much
into the
> >>>>>>> implementation at this point; it was just the first idea
I had
> about
> >>>>>>> how we
> >>>>>>> could do it :). I'm more concerned with the idea and its
security
> >>>>>>> implications right now.
> >>>>>>>
> >>>>>>> In general, it seems like people are ok with it protected
by a new
> >>>>>>> permission role. Do you have more to add, Mike? Was your
comment
> based
> >>>>>>> on
> >>>>>>> your interpretation of how Accumulo works or more a concern
about
> >>>>>>> implementing such a feature?
> >>>>>>>
> >>>>>>> On Oct 11, 2016 21:29,<dlmarion@comcast.net>  wrote:
> >>>>>>>
> >>>>>>>> So, to get the set of visibilities used in a table,
we would have
> to
> >>>>>>>> open
> >>>>>>>> all of the rfiles?
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> >>>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
> >>>>>>>>> To: Accumulo Dev List
> >>>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram
on a table be
> >>>>>>>>> harmful?
> >>>>>>>>> Interesting idea. It begs the question: should we
allow any
> custom
> >>>>>>>>> index at
> >>>>>>>>> the RFile level? If RFile indexes were user-extensible,
then a
> >>>>>>>>> visibility index
> >>>>>>>>> would be something any developer could write. That
said, we can
> still
> >>>>>>>>> include such an index as an example, and if we did
it could be
> used by
> >>>>>>>>> the
> >>>>>>>>> Accumulo monitor.
> >>>>>>>>>
> >>>>>>>>> The RFile-level sampling followed this path. I would
support
> further
> >>>>>>>>> work
> >>>>>>>>> similar to it, though I admit I don't know how difficult
a job it
> >>>>>>>>> entails.
> >>>>>>>>> Bonus points if the index information could be accessed
from
> iterators
> >>>>>>>>> the
> >>>>>>>>> same way that sampled data can.
> >>>>>>>>>
> >>>>>>>>> I can't speak to the appropriateness of visibility
histograms on
> the
> >>>>>>>>> monitor
> >>>>>>>>> *by default*, but it would be a strictly useful
feature if it
> could be
> >>>>>>>>> enabled via
> >>>>>>>>> a conf option.
> >>>>>>>>>
> >>>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<
> josh.elser@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Today at Accumulo Summit, our own Russ Weeks
gave a talk. One
> topic
> >>>>>>>>>> he
> >>>>>>>>>> mentioned was the lack of insight into the distribution
of data
> >>>>>>>>>> marked
> >>>>>>>>>> with certain visibilities in a table. He presented
an example
> similar
> >>>>>>>>>> to this:
> >>>>>>>>>> Image a hypothetical system backed by Accumulo
which stores
> medical
> >>>>>>>>>> information. There are three labels in the system:
PRIVATE,
> >>>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that
which could
> reasonably
> >>>>>>>>>> be
> >>>>>>>>>> considered to identify the individual. ANONYMIZED
data is some
> >>>>>>>>>> altered
> >>>>>>>>>> version of the attribute that retains some portion
of the
> original
> >>>>>>>>>> value, but is missing enough context to not
identify the
> individual
> >>>>>>>>>> (e.g. converting the name "Josh Elser" to "J
E"). PUBLIC data
> is for
> >>>>>>>>>> attributes which are cannot identify the individual.
> >>>>>>>>>>
> >>>>>>>>>> Doctors would be able to read the PRIVATE data,
while
> researchers
> >>>>>>>>>> could only read the ANONYMIZED and PUBLIC data.
This leads to a
> >>>>>>>>>> question: how much of each kind of data is in
the system?
> Without
> >>>>>>>>>> knowing how much data is in the system, how
can some application
> >>>>>>>>>> developer (who does not have the ability to
read all of the
> PRIVATE
> >>>>>>>>>> data) know that their application is returning
an reasonably
> correct
> >>>>>>>>>> amount of data? (there are many examples of
questions which
> could be
> >>>>>>>>>> answer on this data alone)
> >>>>>>>>>>
> >>>>>>>>>> Concretely, this histogram would look like (50
records with
> PRIVATE,
> >>>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120
records total):
> >>>>>>>>>>
> >>>>>>>>>> PRIVATE: 50
> >>>>>>>>>> ANONYMIZED: 50
> >>>>>>>>>> PUBLIC: 20
> >>>>>>>>>>
> >>>>>>>>>> Technically, I think this would actually be
relatively simple to
> >>>>>>>>>> implement. Inside of each RFile, we could maintain
some
> histogram of
> >>>>>>>>>> the visibilities observed in that file. This
would allow us to
> very
> >>>>>>>>>> easily report how much data in each table has
each visibility
> label.
> >>>>>>>>>>
> >>>>>>>>>> However, would this feature be harmful to one
of the core
> tenants of
> >>>>>>>>>> Accumulo? Or, is acknowledging the existence
of data in
> Accumulo with
> >>>>>>>>>> a certain visibility acceptable? Would a new
permission to use
> such
> >>>>>>>>>> an
> >>>>>>>>>> API to access this information be sufficient
to protect the
> data?
> >>>>>>>>>>
> >>>>>>>>>> *   Josh
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message