accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject [DISCUSS] Would a visibility histogram on a table be harmful?
Date Tue, 11 Oct 2016 19:20:09 GMT
Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he 
mentioned was the lack of insight into the distribution of data marked 
with certain visibilities in a table. He presented an example similar to 
this:

Image a hypothetical system backed by Accumulo which stores medical 
information. There are three labels in the system: PRIVATE, ANONYMIZED, 
and PUBLIC. PRIVATE data is that which could reasonably be considered to 
identify the individual. ANONYMIZED data is some altered version of the 
attribute that retains some portion of the original value, but is 
missing enough context to not identify the individual (e.g. converting 
the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are 
cannot identify the individual.

Doctors would be able to read the PRIVATE data, while researchers could 
only read the ANONYMIZED and PUBLIC data. This leads to a question: how 
much of each kind of data is in the system? Without knowing how much 
data is in the system, how can some application developer (who does not 
have the ability to read all of the PRIVATE data) know that their 
application is returning an reasonably correct amount of data? (there 
are many examples of questions which could be answer on this data alone)

Concretely, this histogram would look like (50 records with PRIVATE, 50 
with ANONYMIZED, and 20 with PUBLIC; 120 records total):

```
PRIVATE: 50
ANONYMIZED: 50
PUBLIC: 20
```

Technically, I think this would actually be relatively simple to 
implement. Inside of each RFile, we could maintain some histogram of the 
visibilities observed in that file. This would allow us to very easily 
report how much data in each table has each visibility label.

However, would this feature be harmful to one of the core tenants of 
Accumulo? Or, is acknowledging the existence of data in Accumulo with a 
certain visibility acceptable? Would a new permission to use such an API 
to access this information be sufficient to protect the data?

- Josh

Mime
View raw message