accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Wed, 12 Oct 2016 19:57:22 GMT
I was envisioning public API protected by a system permission (implying 
some Thrift RPC as well) if that is an important distinction for those 
with concerns. I am hoping to get more info from Mike/Marc about why 
they feel this is insufficient WRT Accumulo's security model.

Keith Turner wrote:
> We did discuss making this info available through the public API (and
> adding thrift calls to gather it).   We discussed the possibility of
> adding a new permission.
>
> On Wed, Oct 12, 2016 at 2:35 PM, ivan bella<ivan@ivan.bella.name>  wrote:
>> I do not see how this invalidates any security of the system unless you are summarizing
these counters and making them available through a thrift or other call; don't do that unless
other security is put in place.  To get a summary I would think you would have to use a separate
utility to scrape the rfiles.  This metadata should only be accessible to a system administrator.
 The BIG presumption here is that is is significantly faster to grab this metadata data out
than it is to scan all of the keys in the rfile.
>>
>>
>>> On October 12, 2016 at 1:41 PM Josh Elser<josh.elser@gmail.com>  wrote:
>>>
>>> Thanks, Marc. Follow-on question(s) for you:
>>>
>>> Do you think _any_ such approach should never be pursued by Accumulo
>>> (reading into your other replies about doing it outside of Accumulo)?
>>> Are the permissions that we have in place not sufficient to protect such
>>> "metadata"?
>>>
>>> Or, would such a feature be "OK" to you if it required some degree of
>>> additional manual steps by the administrator? (if so, what steps do you
>>> think make this acceptable)
>>>
>>> In a similar vein, how do you see this broadening the scope of the
>>> Accumulo security model in an invalid manner? e.g. Administrators should
>>> never be able to see such information. Someone with sufficient access to
>>> a system would already be able to bypass Accumulo's security mechanisms.
>>> There are a number of vectors already were a sufficiently-credentialed
>>> individual could figure out this information (and more).
>>>
>>> Ultimately, I see Accumulo's main security tenet as "users should never
>>> be allowed to see more data than they are authorized to see". Maybe it's
>>> my interpretation of that or the scope of how your think the proposed
>>> feature would function, but I'd be very interested in hearing more about
>>> what you think.
>>>
>>> Marc P. wrote:
>>>
>>>> My point for discussing implementation outside of accumulo is because I
>>>> think it does invalidate a core tenant
>>>>
>>>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<josh.elser@gmail.com>  wrote:
>>>>
>>>>> Again, can we please bring this discussion back from discussions of
>>>>> implementations to security?
>>>>>
>>>>> Does the fact that you three were discussing implementations imply that
>>>>> you do not think this invalidates one of the core tenets (security
>>>>> first) of Accumulo?
>>>>>
>>>>> Christopher wrote:
>>>>>
>>>>>> Keith, Russ, myself (and possible others) were discussing this at
the
>>>>>> hackathon after the Accumulo Summit, and I think our consensus were
>>>>>> basically this:
>>>>>>
>>>>>> We need a generic pluggable mechanism for injecting arbitrary user
>>>>>> counters
>>>>>> into the RFiles. We can then use these counters in custom compaction
>>>>>> strategies, or other analysis. We can aggregate these counters at
the
>>>>>> tablet, and table levels, and expose them in the API.
>>>>>>
>>>>>> These counters could store information about visibility frequencies,
>>>>>> number
>>>>>> of delete entries, etc.
>>>>>>
>>>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
Long>>.
>>>>>> In the discussion, there were lots of variations on the theme, though.
>>>>>> So,
>>>>>> the actual implementation could vary. But, having something like
this
>>>>>> could
>>>>>> support a large number of use cases beyond just the histogram case.
>>>>>>
>>>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<josh.elser@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Trivially. We could do something more intelligent like also cache
it in
>>>>>>> metadata (updating with compactions). Don't read too much into
the
>>>>>>> implementation at this point; it was just the first idea I had
about
>>>>>>> how we
>>>>>>> could do it :). I'm more concerned with the idea and its security
>>>>>>> implications right now.
>>>>>>>
>>>>>>> In general, it seems like people are ok with it protected by
a new
>>>>>>> permission role. Do you have more to add, Mike? Was your comment
based
>>>>>>> on
>>>>>>> your interpretation of how Accumulo works or more a concern about
>>>>>>> implementing such a feature?
>>>>>>>
>>>>>>> On Oct 11, 2016 21:29,<dlmarion@comcast.net>  wrote:
>>>>>>>
>>>>>>>> So, to get the set of visibilities used in a table, we would
have to
>>>>>>>> open
>>>>>>>> all of the rfiles?
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>>>>> To: Accumulo Dev List
>>>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram on
a table be
>>>>>>>>> harmful?
>>>>>>>>> Interesting idea. It begs the question: should we allow
any custom
>>>>>>>>> index at
>>>>>>>>> the RFile level? If RFile indexes were user-extensible,
then a
>>>>>>>>> visibility index
>>>>>>>>> would be something any developer could write. That said,
we can still
>>>>>>>>> include such an index as an example, and if we did it
could be used by
>>>>>>>>> the
>>>>>>>>> Accumulo monitor.
>>>>>>>>>
>>>>>>>>> The RFile-level sampling followed this path. I would
support further
>>>>>>>>> work
>>>>>>>>> similar to it, though I admit I don't know how difficult
a job it
>>>>>>>>> entails.
>>>>>>>>> Bonus points if the index information could be accessed
from iterators
>>>>>>>>> the
>>>>>>>>> same way that sampled data can.
>>>>>>>>>
>>>>>>>>> I can't speak to the appropriateness of visibility histograms
on the
>>>>>>>>> monitor
>>>>>>>>> *by default*, but it would be a strictly useful feature
if it could be
>>>>>>>>> enabled via
>>>>>>>>> a conf option.
>>>>>>>>>
>>>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<josh.elser@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave
a talk. One topic
>>>>>>>>>> he
>>>>>>>>>> mentioned was the lack of insight into the distribution
of data
>>>>>>>>>> marked
>>>>>>>>>> with certain visibilities in a table. He presented
an example similar
>>>>>>>>>> to this:
>>>>>>>>>> Image a hypothetical system backed by Accumulo which
stores medical
>>>>>>>>>> information. There are three labels in the system:
PRIVATE,
>>>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which
could reasonably
>>>>>>>>>> be
>>>>>>>>>> considered to identify the individual. ANONYMIZED
data is some
>>>>>>>>>> altered
>>>>>>>>>> version of the attribute that retains some portion
of the original
>>>>>>>>>> value, but is missing enough context to not identify
the individual
>>>>>>>>>> (e.g. converting the name "Josh Elser" to "J E").
PUBLIC data is for
>>>>>>>>>> attributes which are cannot identify the individual.
>>>>>>>>>>
>>>>>>>>>> Doctors would be able to read the PRIVATE data, while
researchers
>>>>>>>>>> could only read the ANONYMIZED and PUBLIC data. This
leads to a
>>>>>>>>>> question: how much of each kind of data is in the
system? Without
>>>>>>>>>> knowing how much data is in the system, how can some
application
>>>>>>>>>> developer (who does not have the ability to read
all of the PRIVATE
>>>>>>>>>> data) know that their application is returning an
reasonably correct
>>>>>>>>>> amount of data? (there are many examples of questions
which could be
>>>>>>>>>> answer on this data alone)
>>>>>>>>>>
>>>>>>>>>> Concretely, this histogram would look like (50 records
with PRIVATE,
>>>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records
total):
>>>>>>>>>>
>>>>>>>>>> PRIVATE: 50
>>>>>>>>>> ANONYMIZED: 50
>>>>>>>>>> PUBLIC: 20
>>>>>>>>>>
>>>>>>>>>> Technically, I think this would actually be relatively
simple to
>>>>>>>>>> implement. Inside of each RFile, we could maintain
some histogram of
>>>>>>>>>> the visibilities observed in that file. This would
allow us to very
>>>>>>>>>> easily report how much data in each table has each
visibility label.
>>>>>>>>>>
>>>>>>>>>> However, would this feature be harmful to one of
the core tenants of
>>>>>>>>>> Accumulo? Or, is acknowledging the existence of data
in Accumulo with
>>>>>>>>>> a certain visibility acceptable? Would a new permission
to use such
>>>>>>>>>> an
>>>>>>>>>> API to access this information be sufficient to protect
the data?
>>>>>>>>>>
>>>>>>>>>> *   Josh

Mime
View raw message