accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Fri, 14 Oct 2016 18:54:53 GMT
Thanks for the reply, Mike.

Mike Drob wrote:
> Hiding this behind the SystemPermission.SYSTEM permission might be
> sufficient.

Superb. Personally, I wouldn't want to piggy-back on SYSTEM.SYSTEM 
(because that permission implies a lot of other things too), but that's 
an implementation detail we can hash out later.

> In a situation where Accumulo data is on an encrypted volume, or the rfiles
> themselves are encrypted, then a root user wouldn't be able to read the
> rfiles to generate the histograms. This matches my initial mental model of
> an admin user that doesn't necessarily need to access to data and data
> users that don't have access to admin commands. There is no all powerful
> root user that can do everything and read everything.

I agree with you that we should not assume an admin has the ability to 
read all data in all cases. In some cases it might, but the encrypted 
files is one good example that guarantees that cannot happen. I do draw 
a distinction between being able to read all data and generating a count 
of the unique visibility labels. I think that, in most cases, such a 
sketch on the visibilities in the system does not leak any sensitive 
data; however, hiding that access behind a system permission is a good 
compromise for those whose use-cases I haven't considered :)

> Have we ever discussed an "emergency access, give me all the permissions"
> model? I feel like I've heard John Vines mention this before, I think. This
> would be a reasonable extensions of that.

I don't recall hearing of that one before, and I don't think I agree 
that this proposal is an extension of it. The number of records in the 
system and the visibility of them are purely "metadata" which do not 
expose identifying information about the actual data.

> Mike
> On Fri, Oct 14, 2016 at 11:06 AM, Josh Elser<>  wrote:
>> Ping Marc/Mike D.
>> Josh Elser wrote:
>>> Thanks, Marc. Follow-on question(s) for you:
>>> Do you think _any_ such approach should never be pursued by Accumulo
>>> (reading into your other replies about doing it outside of Accumulo)?
>>> Are the permissions that we have in place not sufficient to protect such
>>> "metadata"?
>>> Or, would such a feature be "OK" to you if it required some degree of
>>> additional manual steps by the administrator? (if so, what steps do you
>>> think make this acceptable)
>>> In a similar vein, how do you see this broadening the scope of the
>>> Accumulo security model in an invalid manner? e.g. Administrators should
>>> never be able to see such information. Someone with sufficient access to
>>> a system would already be able to bypass Accumulo's security mechanisms.
>>> There are a number of vectors already were a sufficiently-credentialed
>>> individual could figure out this information (and more).
>>> Ultimately, I see Accumulo's main security tenet as "users should never
>>> be allowed to see more data than they are authorized to see". Maybe it's
>>> my interpretation of that or the scope of how your think the proposed
>>> feature would function, but I'd be very interested in hearing more about
>>> what you think.
>>> Marc P. wrote:
>>>> My point for discussing implementation outside of accumulo is because I
>>>> think it does invalidate a core tenant
>>>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<>  wrote:
>>>> Again, can we please bring this discussion back from discussions of
>>>>> implementations to security?
>>>>> Does the fact that you three were discussing implementations imply that
>>>>> you do not think this invalidates one of the core tenets (security
>>>>> first) of Accumulo?
>>>>> Christopher wrote:
>>>>>> Keith, Russ, myself (and possible others) were discussing this at
>>>>>> hackathon after the Accumulo Summit, and I think our consensus were
>>>>>> basically this:
>>>>>> We need a generic pluggable mechanism for injecting arbitrary user
>>>>> counters
>>>>>> into the RFiles. We can then use these counters in custom compaction
>>>>>> strategies, or other analysis. We can aggregate these counters at
>>>>>> tablet, and table levels, and expose them in the API.
>>>>>> These counters could store information about visibility frequencies,
>>>>> number
>>>>>> of delete entries, etc.
>>>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
>>>>> Long>>.
>>>>>> In the discussion, there were lots of variations on the theme, though.
>>>>> So,
>>>>>> the actual implementation could vary. But, having something like
>>>>> could
>>>>>> support a large number of use cases beyond just the histogram case.
>>>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<>
>>>>> wrote:
>>>>>> Trivially. We could do something more intelligent like also cache
>>>>>>> it in
>>>>>>> metadata (updating with compactions). Don't read too much into
>>>>>>> implementation at this point; it was just the first idea I had
>>>>>> how we
>>>>>> could do it :). I'm more concerned with the idea and its security
>>>>>>> implications right now.
>>>>>>> In general, it seems like people are ok with it protected by
a new
>>>>>>> permission role. Do you have more to add, Mike? Was your comment
>>>>>> on
>>>>>> your interpretation of how Accumulo works or more a concern about
>>>>>>> implementing such a feature?
>>>>>>> On Oct 11, 2016 21:29,<>  wrote:
>>>>>>> So, to get the set of visibilities used in a table, we would
have to
>>>>>>> open
>>>>>> all of the rfiles?
>>>>>>>> -----Original Message-----
>>>>>>>>> From: Dylan Hutchison []
>>>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>>>>> To: Accumulo Dev List
>>>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram on
a table be
>>>>>>>> harmful?
>>>>>>>>> Interesting idea. It begs the question: should we allow
any custom
>>>>>>>> index at
>>>>>>>>> the RFile level? If RFile indexes were user-extensible,
then a
>>>>>>>> visibility index
>>>>>>>>> would be something any developer could write. That said,
we can
>>>>>>>>> still
>>>>>>>>> include such an index as an example, and if we did it
could be
>>>>>>>>> used by
>>>>>>>> the
>>>>>>>>> Accumulo monitor.
>>>>>>>>> The RFile-level sampling followed this path. I would
support further
>>>>>>>> work
>>>>>>>>> similar to it, though I admit I don't know how difficult
a job it
>>>>>>>> entails.
>>>>>>>>> Bonus points if the index information could be accessed
>>>>>>>>> iterators
>>>>>>>> the
>>>>>>>>> same way that sampled data can.
>>>>>>>>> I can't speak to the appropriateness of visibility histograms
on the
>>>>>>>> monitor
>>>>>>>>> *by default*, but it would be a strictly useful feature
if it
>>>>>>>>> could be
>>>>>>>> enabled via
>>>>>>>>> a conf option.
>>>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<>
>>>>>>>> wrote:
>>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk.
One topic
>>>>>>>>> he
>>>>>>>> mentioned was the lack of insight into the distribution of
>>>>>>>>> marked
>>>>>>>> with certain visibilities in a table. He presented an example
>>>>>>>>>> similar
>>>>>>>>> to this:
>>>>>>>>> Image a hypothetical system backed by Accumulo which
stores medical
>>>>>>>>>> information. There are three labels in the system:
>>>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which
could reasonably
>>>>>>>>> be
>>>>>>>> considered to identify the individual. ANONYMIZED data is
>>>>>>>>> altered
>>>>>>>> version of the attribute that retains some portion of the
>>>>>>>>>> value, but is missing enough context to not identify
the individual
>>>>>>>>>> (e.g. converting the name "Josh Elser" to "J E").
PUBLIC data is
>>>>>>>>>> for
>>>>>>>>>> attributes which are cannot identify the individual.
>>>>>>>>>> Doctors would be able to read the PRIVATE data, while
>>>>>>>>>> could only read the ANONYMIZED and PUBLIC data. This
leads to a
>>>>>>>>>> question: how much of each kind of data is in the
system? Without
>>>>>>>>>> knowing how much data is in the system, how can some
>>>>>>>>>> developer (who does not have the ability to read
all of the PRIVATE
>>>>>>>>>> data) know that their application is returning an
>>>>>>>>>> correct
>>>>>>>>>> amount of data? (there are many examples of questions
>>>>>>>>>> could be
>>>>>>>>>> answer on this data alone)
>>>>>>>>>> Concretely, this histogram would look like (50 records
>>>>>>>>>> PRIVATE,
>>>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records
>>>>>>>>>> ```
>>>>>>>>>> PRIVATE: 50
>>>>>>>>>> ANONYMIZED: 50
>>>>>>>>>> PUBLIC: 20
>>>>>>>>>> ```
>>>>>>>>>> Technically, I think this would actually be relatively
simple to
>>>>>>>>>> implement. Inside of each RFile, we could maintain
>>>>>>>>>> histogram of
>>>>>>>>>> the visibilities observed in that file. This would
allow us to very
>>>>>>>>>> easily report how much data in each table has each
>>>>>>>>>> label.
>>>>>>>>>> However, would this feature be harmful to one of
the core
>>>>>>>>>> tenants of
>>>>>>>>>> Accumulo? Or, is acknowledging the existence of data
in Accumulo
>>>>>>>>>> with
>>>>>>>>>> a certain visibility acceptable? Would a new permission
to use such
>>>>>>>>> an
>>>>>>>> API to access this information be sufficient to protect the
>>>>>>>>>> - Josh

View raw message