accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Drob <md...@mdrob.com>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Tue, 11 Oct 2016 20:32:59 GMT
I've always been under the impression that accumulo was not supposed to
confirm the existence of data that a user did not have permission to read.

On Tue, Oct 11, 2016, 2:20 PM Josh Elser <josh.elser@gmail.com> wrote:

> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
> mentioned was the lack of insight into the distribution of data marked
> with certain visibilities in a table. He presented an example similar to
> this:
>
> Image a hypothetical system backed by Accumulo which stores medical
> information. There are three labels in the system: PRIVATE, ANONYMIZED,
> and PUBLIC. PRIVATE data is that which could reasonably be considered to
> identify the individual. ANONYMIZED data is some altered version of the
> attribute that retains some portion of the original value, but is
> missing enough context to not identify the individual (e.g. converting
> the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are
> cannot identify the individual.
>
> Doctors would be able to read the PRIVATE data, while researchers could
> only read the ANONYMIZED and PUBLIC data. This leads to a question: how
> much of each kind of data is in the system? Without knowing how much
> data is in the system, how can some application developer (who does not
> have the ability to read all of the PRIVATE data) know that their
> application is returning an reasonably correct amount of data? (there
> are many examples of questions which could be answer on this data alone)
>
> Concretely, this histogram would look like (50 records with PRIVATE, 50
> with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>
> ```
> PRIVATE: 50
> ANONYMIZED: 50
> PUBLIC: 20
> ```
>
> Technically, I think this would actually be relatively simple to
> implement. Inside of each RFile, we could maintain some histogram of the
> visibilities observed in that file. This would allow us to very easily
> report how much data in each table has each visibility label.
>
> However, would this feature be harmful to one of the core tenants of
> Accumulo? Or, is acknowledging the existence of data in Accumulo with a
> certain visibility acceptable? Would a new permission to use such an API
> to access this information be sufficient to protect the data?
>
> - Josh
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message