accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Mon, 17 Oct 2016 04:03:44 GMT
A nice round number to track this work:

Josh Elser wrote:
> Thanks for the reply, Mike.
> Mike Drob wrote:
>> Hiding this behind the SystemPermission.SYSTEM permission might be
>> sufficient.
> Superb. Personally, I wouldn't want to piggy-back on SYSTEM.SYSTEM
> (because that permission implies a lot of other things too), but that's
> an implementation detail we can hash out later.
>> In a situation where Accumulo data is on an encrypted volume, or the
>> rfiles
>> themselves are encrypted, then a root user wouldn't be able to read the
>> rfiles to generate the histograms. This matches my initial mental
>> model of
>> an admin user that doesn't necessarily need to access to data and data
>> users that don't have access to admin commands. There is no all powerful
>> root user that can do everything and read everything.
> I agree with you that we should not assume an admin has the ability to
> read all data in all cases. In some cases it might, but the encrypted
> files is one good example that guarantees that cannot happen. I do draw
> a distinction between being able to read all data and generating a count
> of the unique visibility labels. I think that, in most cases, such a
> sketch on the visibilities in the system does not leak any sensitive
> data; however, hiding that access behind a system permission is a good
> compromise for those whose use-cases I haven't considered :)
>> Have we ever discussed an "emergency access, give me all the permissions"
>> model? I feel like I've heard John Vines mention this before, I think.
>> This
>> would be a reasonable extensions of that.
> I don't recall hearing of that one before, and I don't think I agree
> that this proposal is an extension of it. The number of records in the
> system and the visibility of them are purely "metadata" which do not
> expose identifying information about the actual data.
>> Mike
>> On Fri, Oct 14, 2016 at 11:06 AM, Josh Elser<> wrote:
>>> Ping Marc/Mike D.
>>> Josh Elser wrote:
>>>> Thanks, Marc. Follow-on question(s) for you:
>>>> Do you think _any_ such approach should never be pursued by Accumulo
>>>> (reading into your other replies about doing it outside of Accumulo)?
>>>> Are the permissions that we have in place not sufficient to protect
>>>> such
>>>> "metadata"?
>>>> Or, would such a feature be "OK" to you if it required some degree of
>>>> additional manual steps by the administrator? (if so, what steps do you
>>>> think make this acceptable)
>>>> In a similar vein, how do you see this broadening the scope of the
>>>> Accumulo security model in an invalid manner? e.g. Administrators
>>>> should
>>>> never be able to see such information. Someone with sufficient
>>>> access to
>>>> a system would already be able to bypass Accumulo's security
>>>> mechanisms.
>>>> There are a number of vectors already were a sufficiently-credentialed
>>>> individual could figure out this information (and more).
>>>> Ultimately, I see Accumulo's main security tenet as "users should never
>>>> be allowed to see more data than they are authorized to see". Maybe
>>>> it's
>>>> my interpretation of that or the scope of how your think the proposed
>>>> feature would function, but I'd be very interested in hearing more
>>>> about
>>>> what you think.
>>>> Marc P. wrote:
>>>>> My point for discussing implementation outside of accumulo is
>>>>> because I
>>>>> think it does invalidate a core tenant
>>>>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<>
>>>>> Again, can we please bring this discussion back from discussions of
>>>>>> implementations to security?
>>>>>> Does the fact that you three were discussing implementations imply
>>>>>> that
>>>>>> you do not think this invalidates one of the core tenets (security
>>>>>> first) of Accumulo?
>>>>>> Christopher wrote:
>>>>>>> Keith, Russ, myself (and possible others) were discussing this
>>>>>>> the
>>>>>>> hackathon after the Accumulo Summit, and I think our consensus
>>>>>>> basically this:
>>>>>>> We need a generic pluggable mechanism for injecting arbitrary
>>>>>> counters
>>>>>>> into the RFiles. We can then use these counters in custom compaction
>>>>>>> strategies, or other analysis. We can aggregate these counters
>>>>>>> the
>>>>>>> tablet, and table levels, and expose them in the API.
>>>>>>> These counters could store information about visibility frequencies,
>>>>>> number
>>>>>>> of delete entries, etc.
>>>>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
>>>>>> Long>>.
>>>>>>> In the discussion, there were lots of variations on the theme,
>>>>>>> though.
>>>>>> So,
>>>>>>> the actual implementation could vary. But, having something like
>>>>>>> this
>>>>>> could
>>>>>>> support a large number of use cases beyond just the histogram
>>>>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<>
>>>>>> wrote:
>>>>>>> Trivially. We could do something more intelligent like also cache
>>>>>>>> it in
>>>>>>>> metadata (updating with compactions). Don't read too much
into the
>>>>>>>> implementation at this point; it was just the first idea
I had
>>>>>>>> about
>>>>>>> how we
>>>>>>> could do it :). I'm more concerned with the idea and its security
>>>>>>>> implications right now.
>>>>>>>> In general, it seems like people are ok with it protected
by a new
>>>>>>>> permission role. Do you have more to add, Mike? Was your
>>>>>>>> based
>>>>>>> on
>>>>>>> your interpretation of how Accumulo works or more a concern about
>>>>>>>> implementing such a feature?
>>>>>>>> On Oct 11, 2016 21:29,<> wrote:
>>>>>>>> So, to get the set of visibilities used in a table, we would
>>>>>>>> have to
>>>>>>>> open
>>>>>>> all of the rfiles?
>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Dylan Hutchison []
>>>>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>>>>>> To: Accumulo Dev List
>>>>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram
on a table be
>>>>>>>>> harmful?
>>>>>>>>>> Interesting idea. It begs the question: should we
allow any
>>>>>>>>>> custom
>>>>>>>>> index at
>>>>>>>>>> the RFile level? If RFile indexes were user-extensible,
then a
>>>>>>>>> visibility index
>>>>>>>>>> would be something any developer could write. That
said, we can
>>>>>>>>>> still
>>>>>>>>>> include such an index as an example, and if we did
it could be
>>>>>>>>>> used by
>>>>>>>>> the
>>>>>>>>>> Accumulo monitor.
>>>>>>>>>> The RFile-level sampling followed this path. I would
>>>>>>>>>> further
>>>>>>>>> work
>>>>>>>>>> similar to it, though I admit I don't know how difficult
a job it
>>>>>>>>> entails.
>>>>>>>>>> Bonus points if the index information could be accessed
>>>>>>>>>> iterators
>>>>>>>>> the
>>>>>>>>>> same way that sampled data can.
>>>>>>>>>> I can't speak to the appropriateness of visibility
>>>>>>>>>> on the
>>>>>>>>> monitor
>>>>>>>>>> *by default*, but it would be a strictly useful feature
if it
>>>>>>>>>> could be
>>>>>>>>> enabled via
>>>>>>>>>> a conf option.
>>>>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh
>>>>>>>>>> Elser<>
>>>>>>>>> wrote:
>>>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave
a talk. One
>>>>>>>>>> topic
>>>>>>>>>> he
>>>>>>>>> mentioned was the lack of insight into the distribution
of data
>>>>>>>>>> marked
>>>>>>>>> with certain visibilities in a table. He presented an
>>>>>>>>>>> similar
>>>>>>>>>> to this:
>>>>>>>>>> Image a hypothetical system backed by Accumulo which
>>>>>>>>>> medical
>>>>>>>>>>> information. There are three labels in the system:
>>>>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that
which could
>>>>>>>>>>> reasonably
>>>>>>>>>> be
>>>>>>>>> considered to identify the individual. ANONYMIZED data
is some
>>>>>>>>>> altered
>>>>>>>>> version of the attribute that retains some portion of
the original
>>>>>>>>>>> value, but is missing enough context to not identify
>>>>>>>>>>> individual
>>>>>>>>>>> (e.g. converting the name "Josh Elser" to "J
E"). PUBLIC data is
>>>>>>>>>>> for
>>>>>>>>>>> attributes which are cannot identify the individual.
>>>>>>>>>>> Doctors would be able to read the PRIVATE data,
>>>>>>>>>>> researchers
>>>>>>>>>>> could only read the ANONYMIZED and PUBLIC data.
This leads to a
>>>>>>>>>>> question: how much of each kind of data is in
the system?
>>>>>>>>>>> Without
>>>>>>>>>>> knowing how much data is in the system, how can
some application
>>>>>>>>>>> developer (who does not have the ability to read
all of the
>>>>>>>>>>> PRIVATE
>>>>>>>>>>> data) know that their application is returning
an reasonably
>>>>>>>>>>> correct
>>>>>>>>>>> amount of data? (there are many examples of questions
>>>>>>>>>>> could be
>>>>>>>>>>> answer on this data alone)
>>>>>>>>>>> Concretely, this histogram would look like (50
records with
>>>>>>>>>>> PRIVATE,
>>>>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records
>>>>>>>>>>> ```
>>>>>>>>>>> PRIVATE: 50
>>>>>>>>>>> ANONYMIZED: 50
>>>>>>>>>>> PUBLIC: 20
>>>>>>>>>>> ```
>>>>>>>>>>> Technically, I think this would actually be relatively
simple to
>>>>>>>>>>> implement. Inside of each RFile, we could maintain
>>>>>>>>>>> histogram of
>>>>>>>>>>> the visibilities observed in that file. This
would allow us
>>>>>>>>>>> to very
>>>>>>>>>>> easily report how much data in each table has
each visibility
>>>>>>>>>>> label.
>>>>>>>>>>> However, would this feature be harmful to one
of the core
>>>>>>>>>>> tenants of
>>>>>>>>>>> Accumulo? Or, is acknowledging the existence
of data in Accumulo
>>>>>>>>>>> with
>>>>>>>>>>> a certain visibility acceptable? Would a new
permission to
>>>>>>>>>>> use such
>>>>>>>>>> an
>>>>>>>>> API to access this information be sufficient to protect
the data?
>>>>>>>>>>> - Josh

View raw message