accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Fri, 14 Oct 2016 15:06:58 GMT
Ping Marc/Mike D.

Josh Elser wrote:
> Thanks, Marc. Follow-on question(s) for you:
>
> Do you think _any_ such approach should never be pursued by Accumulo
> (reading into your other replies about doing it outside of Accumulo)?
> Are the permissions that we have in place not sufficient to protect such
> "metadata"?
>
> Or, would such a feature be "OK" to you if it required some degree of
> additional manual steps by the administrator? (if so, what steps do you
> think make this acceptable)
>
> In a similar vein, how do you see this broadening the scope of the
> Accumulo security model in an invalid manner? e.g. Administrators should
> never be able to see such information. Someone with sufficient access to
> a system would already be able to bypass Accumulo's security mechanisms.
> There are a number of vectors already were a sufficiently-credentialed
> individual could figure out this information (and more).
>
> Ultimately, I see Accumulo's main security tenet as "users should never
> be allowed to see more data than they are authorized to see". Maybe it's
> my interpretation of that or the scope of how your think the proposed
> feature would function, but I'd be very interested in hearing more about
> what you think.
>
> Marc P. wrote:
>> My point for discussing implementation outside of accumulo is because I
>> think it does invalidate a core tenant
>>
>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<josh.elser@gmail.com> wrote:
>>
>>> Again, can we please bring this discussion back from discussions of
>>> implementations to security?
>>>
>>> Does the fact that you three were discussing implementations imply that
>>> you do not think this invalidates one of the core tenets (security
>>> first) of Accumulo?
>>>
>>> Christopher wrote:
>>>> Keith, Russ, myself (and possible others) were discussing this at the
>>>> hackathon after the Accumulo Summit, and I think our consensus were
>>>> basically this:
>>>>
>>>> We need a generic pluggable mechanism for injecting arbitrary user
>>> counters
>>>> into the RFiles. We can then use these counters in custom compaction
>>>> strategies, or other analysis. We can aggregate these counters at the
>>>> tablet, and table levels, and expose them in the API.
>>>>
>>>> These counters could store information about visibility frequencies,
>>> number
>>>> of delete entries, etc.
>>>>
>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
>>> Long>>.
>>>> In the discussion, there were lots of variations on the theme, though.
>>> So,
>>>> the actual implementation could vary. But, having something like this
>>> could
>>>> support a large number of use cases beyond just the histogram case.
>>>>
>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<josh.elser@gmail.com>
>>> wrote:
>>>>> Trivially. We could do something more intelligent like also cache
>>>>> it in
>>>>> metadata (updating with compactions). Don't read too much into the
>>>>> implementation at this point; it was just the first idea I had about
>>> how we
>>>>> could do it :). I'm more concerned with the idea and its security
>>>>> implications right now.
>>>>>
>>>>> In general, it seems like people are ok with it protected by a new
>>>>> permission role. Do you have more to add, Mike? Was your comment based
>>> on
>>>>> your interpretation of how Accumulo works or more a concern about
>>>>> implementing such a feature?
>>>>>
>>>>> On Oct 11, 2016 21:29,<dlmarion@comcast.net> wrote:
>>>>>
>>>>>> So, to get the set of visibilities used in a table, we would have
to
>>> open
>>>>>> all of the rfiles?
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>>> To: Accumulo Dev List
>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table
be
>>>>>> harmful?
>>>>>>> Interesting idea. It begs the question: should we allow any custom
>>>>>> index at
>>>>>>> the RFile level? If RFile indexes were user-extensible, then
a
>>>>>> visibility index
>>>>>>> would be something any developer could write. That said, we can
>>>>>>> still
>>>>>>> include such an index as an example, and if we did it could be
>>>>>>> used by
>>>>>> the
>>>>>>> Accumulo monitor.
>>>>>>>
>>>>>>> The RFile-level sampling followed this path. I would support
further
>>>>>> work
>>>>>>> similar to it, though I admit I don't know how difficult a job
it
>>>>>> entails.
>>>>>>> Bonus points if the index information could be accessed from
>>>>>>> iterators
>>>>>> the
>>>>>>> same way that sampled data can.
>>>>>>>
>>>>>>> I can't speak to the appropriateness of visibility histograms
on the
>>>>>> monitor
>>>>>>> *by default*, but it would be a strictly useful feature if it
>>>>>>> could be
>>>>>> enabled via
>>>>>>> a conf option.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<josh.elser@gmail.com>
>>>>>> wrote:
>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk.
One topic
>>>>> he
>>>>>>>> mentioned was the lack of insight into the distribution of
data
>>>>> marked
>>>>>>>> with certain visibilities in a table. He presented an example
>>>>>>>> similar
>>>>>> to this:
>>>>>>>> Image a hypothetical system backed by Accumulo which stores
medical
>>>>>>>> information. There are three labels in the system: PRIVATE,
>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could
reasonably
>>>>> be
>>>>>>>> considered to identify the individual. ANONYMIZED data is
some
>>>>> altered
>>>>>>>> version of the attribute that retains some portion of the
original
>>>>>>>> value, but is missing enough context to not identify the
individual
>>>>>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC
data is
>>>>>>>> for
>>>>>>>> attributes which are cannot identify the individual.
>>>>>>>>
>>>>>>>> Doctors would be able to read the PRIVATE data, while researchers
>>>>>>>> could only read the ANONYMIZED and PUBLIC data. This leads
to a
>>>>>>>> question: how much of each kind of data is in the system?
Without
>>>>>>>> knowing how much data is in the system, how can some application
>>>>>>>> developer (who does not have the ability to read all of the
PRIVATE
>>>>>>>> data) know that their application is returning an reasonably
>>>>>>>> correct
>>>>>>>> amount of data? (there are many examples of questions which
>>>>>>>> could be
>>>>>>>> answer on this data alone)
>>>>>>>>
>>>>>>>> Concretely, this histogram would look like (50 records with
>>>>>>>> PRIVATE,
>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>>>>>>>>
>>>>>>>> ```
>>>>>>>> PRIVATE: 50
>>>>>>>> ANONYMIZED: 50
>>>>>>>> PUBLIC: 20
>>>>>>>> ```
>>>>>>>>
>>>>>>>> Technically, I think this would actually be relatively simple
to
>>>>>>>> implement. Inside of each RFile, we could maintain some
>>>>>>>> histogram of
>>>>>>>> the visibilities observed in that file. This would allow
us to very
>>>>>>>> easily report how much data in each table has each visibility
>>>>>>>> label.
>>>>>>>>
>>>>>>>> However, would this feature be harmful to one of the core
>>>>>>>> tenants of
>>>>>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo
>>>>>>>> with
>>>>>>>> a certain visibility acceptable? Would a new permission to
use such
>>>>> an
>>>>>>>> API to access this information be sufficient to protect the
data?
>>>>>>>>
>>>>>>>> - Josh
>>>>>>>>
>>

Mime
View raw message