accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Mon, 17 Oct 2016 04:03:44 GMT
A nice round number to track this work: 
https://issues.apache.org/jira/browse/ACCUMULO-4500

Josh Elser wrote:
> Thanks for the reply, Mike.
>
> Mike Drob wrote:
>> Hiding this behind the SystemPermission.SYSTEM permission might be
>> sufficient.
>
> Superb. Personally, I wouldn't want to piggy-back on SYSTEM.SYSTEM
> (because that permission implies a lot of other things too), but that's
> an implementation detail we can hash out later.
>
>> In a situation where Accumulo data is on an encrypted volume, or the
>> rfiles
>> themselves are encrypted, then a root user wouldn't be able to read the
>> rfiles to generate the histograms. This matches my initial mental
>> model of
>> an admin user that doesn't necessarily need to access to data and data
>> users that don't have access to admin commands. There is no all powerful
>> root user that can do everything and read everything.
>
> I agree with you that we should not assume an admin has the ability to
> read all data in all cases. In some cases it might, but the encrypted
> files is one good example that guarantees that cannot happen. I do draw
> a distinction between being able to read all data and generating a count
> of the unique visibility labels. I think that, in most cases, such a
> sketch on the visibilities in the system does not leak any sensitive
> data; however, hiding that access behind a system permission is a good
> compromise for those whose use-cases I haven't considered :)
>
>> Have we ever discussed an "emergency access, give me all the permissions"
>> model? I feel like I've heard John Vines mention this before, I think.
>> This
>> would be a reasonable extensions of that.
>
> I don't recall hearing of that one before, and I don't think I agree
> that this proposal is an extension of it. The number of records in the
> system and the visibility of them are purely "metadata" which do not
> expose identifying information about the actual data.
>
>> Mike
>>
>> On Fri, Oct 14, 2016 at 11:06 AM, Josh Elser<josh.elser@gmail.com> wrote:
>>
>>> Ping Marc/Mike D.
>>>
>>>
>>> Josh Elser wrote:
>>>
>>>> Thanks, Marc. Follow-on question(s) for you:
>>>>
>>>> Do you think _any_ such approach should never be pursued by Accumulo
>>>> (reading into your other replies about doing it outside of Accumulo)?
>>>> Are the permissions that we have in place not sufficient to protect
>>>> such
>>>> "metadata"?
>>>>
>>>> Or, would such a feature be "OK" to you if it required some degree of
>>>> additional manual steps by the administrator? (if so, what steps do you
>>>> think make this acceptable)
>>>>
>>>> In a similar vein, how do you see this broadening the scope of the
>>>> Accumulo security model in an invalid manner? e.g. Administrators
>>>> should
>>>> never be able to see such information. Someone with sufficient
>>>> access to
>>>> a system would already be able to bypass Accumulo's security
>>>> mechanisms.
>>>> There are a number of vectors already were a sufficiently-credentialed
>>>> individual could figure out this information (and more).
>>>>
>>>> Ultimately, I see Accumulo's main security tenet as "users should never
>>>> be allowed to see more data than they are authorized to see". Maybe
>>>> it's
>>>> my interpretation of that or the scope of how your think the proposed
>>>> feature would function, but I'd be very interested in hearing more
>>>> about
>>>> what you think.
>>>>
>>>> Marc P. wrote:
>>>>
>>>>> My point for discussing implementation outside of accumulo is
>>>>> because I
>>>>> think it does invalidate a core tenant
>>>>>
>>>>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<josh.elser@gmail.com>
wrote:
>>>>>
>>>>> Again, can we please bring this discussion back from discussions of
>>>>>> implementations to security?
>>>>>>
>>>>>> Does the fact that you three were discussing implementations imply
>>>>>> that
>>>>>> you do not think this invalidates one of the core tenets (security
>>>>>> first) of Accumulo?
>>>>>>
>>>>>> Christopher wrote:
>>>>>>
>>>>>>> Keith, Russ, myself (and possible others) were discussing this
at
>>>>>>> the
>>>>>>> hackathon after the Accumulo Summit, and I think our consensus
were
>>>>>>> basically this:
>>>>>>>
>>>>>>> We need a generic pluggable mechanism for injecting arbitrary
user
>>>>>>>
>>>>>> counters
>>>>>>
>>>>>>> into the RFiles. We can then use these counters in custom compaction
>>>>>>> strategies, or other analysis. We can aggregate these counters
at
>>>>>>> the
>>>>>>> tablet, and table levels, and expose them in the API.
>>>>>>>
>>>>>>> These counters could store information about visibility frequencies,
>>>>>>>
>>>>>> number
>>>>>>
>>>>>>> of delete entries, etc.
>>>>>>>
>>>>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
>>>>>>>
>>>>>> Long>>.
>>>>>>
>>>>>>> In the discussion, there were lots of variations on the theme,
>>>>>>> though.
>>>>>>>
>>>>>> So,
>>>>>>
>>>>>>> the actual implementation could vary. But, having something like
>>>>>>> this
>>>>>>>
>>>>>> could
>>>>>>
>>>>>>> support a large number of use cases beyond just the histogram
case.
>>>>>>>
>>>>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<josh.elser@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Trivially. We could do something more intelligent like also cache
>>>>>>>> it in
>>>>>>>> metadata (updating with compactions). Don't read too much
into the
>>>>>>>> implementation at this point; it was just the first idea
I had
>>>>>>>> about
>>>>>>>>
>>>>>>> how we
>>>>>>> could do it :). I'm more concerned with the idea and its security
>>>>>>>> implications right now.
>>>>>>>>
>>>>>>>> In general, it seems like people are ok with it protected
by a new
>>>>>>>> permission role. Do you have more to add, Mike? Was your
comment
>>>>>>>> based
>>>>>>>>
>>>>>>> on
>>>>>>> your interpretation of how Accumulo works or more a concern about
>>>>>>>> implementing such a feature?
>>>>>>>>
>>>>>>>> On Oct 11, 2016 21:29,<dlmarion@comcast.net> wrote:
>>>>>>>>
>>>>>>>> So, to get the set of visibilities used in a table, we would
>>>>>>>> have to
>>>>>>>> open
>>>>>>> all of the rfiles?
>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>>>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>>>>>> To: Accumulo Dev List
>>>>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram
on a table be
>>>>>>>>>>
>>>>>>>>> harmful?
>>>>>>>>>
>>>>>>>>>> Interesting idea. It begs the question: should we
allow any
>>>>>>>>>> custom
>>>>>>>>>>
>>>>>>>>> index at
>>>>>>>>>
>>>>>>>>>> the RFile level? If RFile indexes were user-extensible,
then a
>>>>>>>>>>
>>>>>>>>> visibility index
>>>>>>>>>
>>>>>>>>>> would be something any developer could write. That
said, we can
>>>>>>>>>> still
>>>>>>>>>> include such an index as an example, and if we did
it could be
>>>>>>>>>> used by
>>>>>>>>>>
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> Accumulo monitor.
>>>>>>>>>>
>>>>>>>>>> The RFile-level sampling followed this path. I would
support
>>>>>>>>>> further
>>>>>>>>>>
>>>>>>>>> work
>>>>>>>>>
>>>>>>>>>> similar to it, though I admit I don't know how difficult
a job it
>>>>>>>>>>
>>>>>>>>> entails.
>>>>>>>>>
>>>>>>>>>> Bonus points if the index information could be accessed
from
>>>>>>>>>> iterators
>>>>>>>>>>
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> same way that sampled data can.
>>>>>>>>>>
>>>>>>>>>> I can't speak to the appropriateness of visibility
histograms
>>>>>>>>>> on the
>>>>>>>>>>
>>>>>>>>> monitor
>>>>>>>>>
>>>>>>>>>> *by default*, but it would be a strictly useful feature
if it
>>>>>>>>>> could be
>>>>>>>>>>
>>>>>>>>> enabled via
>>>>>>>>>
>>>>>>>>>> a conf option.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh
>>>>>>>>>> Elser<josh.elser@gmail.com>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave
a talk. One
>>>>>>>>>> topic
>>>>>>>>>> he
>>>>>>>>> mentioned was the lack of insight into the distribution
of data
>>>>>>>>>> marked
>>>>>>>>> with certain visibilities in a table. He presented an
example
>>>>>>>>>>> similar
>>>>>>>>>>>
>>>>>>>>>> to this:
>>>>>>>>>> Image a hypothetical system backed by Accumulo which
stores
>>>>>>>>>> medical
>>>>>>>>>>> information. There are three labels in the system:
PRIVATE,
>>>>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that
which could
>>>>>>>>>>> reasonably
>>>>>>>>>>>
>>>>>>>>>> be
>>>>>>>>> considered to identify the individual. ANONYMIZED data
is some
>>>>>>>>>> altered
>>>>>>>>> version of the attribute that retains some portion of
the original
>>>>>>>>>>> value, but is missing enough context to not identify
the
>>>>>>>>>>> individual
>>>>>>>>>>> (e.g. converting the name "Josh Elser" to "J
E"). PUBLIC data is
>>>>>>>>>>> for
>>>>>>>>>>> attributes which are cannot identify the individual.
>>>>>>>>>>>
>>>>>>>>>>> Doctors would be able to read the PRIVATE data,
while
>>>>>>>>>>> researchers
>>>>>>>>>>> could only read the ANONYMIZED and PUBLIC data.
This leads to a
>>>>>>>>>>> question: how much of each kind of data is in
the system?
>>>>>>>>>>> Without
>>>>>>>>>>> knowing how much data is in the system, how can
some application
>>>>>>>>>>> developer (who does not have the ability to read
all of the
>>>>>>>>>>> PRIVATE
>>>>>>>>>>> data) know that their application is returning
an reasonably
>>>>>>>>>>> correct
>>>>>>>>>>> amount of data? (there are many examples of questions
which
>>>>>>>>>>> could be
>>>>>>>>>>> answer on this data alone)
>>>>>>>>>>>
>>>>>>>>>>> Concretely, this histogram would look like (50
records with
>>>>>>>>>>> PRIVATE,
>>>>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records
total):
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> PRIVATE: 50
>>>>>>>>>>> ANONYMIZED: 50
>>>>>>>>>>> PUBLIC: 20
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> Technically, I think this would actually be relatively
simple to
>>>>>>>>>>> implement. Inside of each RFile, we could maintain
some
>>>>>>>>>>> histogram of
>>>>>>>>>>> the visibilities observed in that file. This
would allow us
>>>>>>>>>>> to very
>>>>>>>>>>> easily report how much data in each table has
each visibility
>>>>>>>>>>> label.
>>>>>>>>>>>
>>>>>>>>>>> However, would this feature be harmful to one
of the core
>>>>>>>>>>> tenants of
>>>>>>>>>>> Accumulo? Or, is acknowledging the existence
of data in Accumulo
>>>>>>>>>>> with
>>>>>>>>>>> a certain visibility acceptable? Would a new
permission to
>>>>>>>>>>> use such
>>>>>>>>>>>
>>>>>>>>>> an
>>>>>>>>> API to access this information be sufficient to protect
the data?
>>>>>>>>>>> - Josh
>>>>>>>>>>>
>>>>>>>>>>>
>>

Mime
View raw message