accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <rwe...@newbrightidea.com>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Tue, 11 Oct 2016 20:54:54 GMT
> I've always been under the impression that accumulo was not supposed to confirm
the existence of data that a user did not have permission to read.

OK, that makes sense, I can see the need for that. But if we follow this
path of keeping the summary data structure in the RFile header (footer?)
then it's just a convenience that's available to anybody who can read the
RFile. At that point it seems like it's just a question of who else should
be allowed to read it and how to grant that access. A system permission
makes a lot of sense to me.

-Russ


On Tue, Oct 11, 2016 at 4:33 PM Mike Drob <mdrob@mdrob.com> wrote:

> I've always been under the impression that accumulo was not supposed to
> confirm the existence of data that a user did not have permission to read.
>
> On Tue, Oct 11, 2016, 2:20 PM Josh Elser <josh.elser@gmail.com> wrote:
>
> > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
> > mentioned was the lack of insight into the distribution of data marked
> > with certain visibilities in a table. He presented an example similar to
> > this:
> >
> > Image a hypothetical system backed by Accumulo which stores medical
> > information. There are three labels in the system: PRIVATE, ANONYMIZED,
> > and PUBLIC. PRIVATE data is that which could reasonably be considered to
> > identify the individual. ANONYMIZED data is some altered version of the
> > attribute that retains some portion of the original value, but is
> > missing enough context to not identify the individual (e.g. converting
> > the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are
> > cannot identify the individual.
> >
> > Doctors would be able to read the PRIVATE data, while researchers could
> > only read the ANONYMIZED and PUBLIC data. This leads to a question: how
> > much of each kind of data is in the system? Without knowing how much
> > data is in the system, how can some application developer (who does not
> > have the ability to read all of the PRIVATE data) know that their
> > application is returning an reasonably correct amount of data? (there
> > are many examples of questions which could be answer on this data alone)
> >
> > Concretely, this histogram would look like (50 records with PRIVATE, 50
> > with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> >
> > ```
> > PRIVATE: 50
> > ANONYMIZED: 50
> > PUBLIC: 20
> > ```
> >
> > Technically, I think this would actually be relatively simple to
> > implement. Inside of each RFile, we could maintain some histogram of the
> > visibilities observed in that file. This would allow us to very easily
> > report how much data in each table has each visibility label.
> >
> > However, would this feature be harmful to one of the core tenants of
> > Accumulo? Or, is acknowledging the existence of data in Accumulo with a
> > certain visibility acceptable? Would a new permission to use such an API
> > to access this information be sufficient to protect the data?
> >
> > - Josh
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message