accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: [DISCUSS] Would a visibility histogram on a table be harmful?
Date Tue, 11 Oct 2016 20:10:40 GMT
Hah, funny you mention custom RFile index. I think Adam Fuchs had 
proposed an idea before similar (probably years ago now) :)

re: the monitor, I was more thinking that it would just be an API call 
to access it. I had not thought about automatically displaying it on the 
monitor (but it is an interesting idea...)

I remember making a ticket a while back to move the RFile header from a 
custom serialized object to a Thrift or Protobuf object which would make 
handling such a drift in "schema" dirt-simple to handle. Eventually 
there's a concern about putting too much data in there (probably 
reachable with a large number of visibilities -- implementation detail), 
but that's a related thought :)

Dylan Hutchison wrote:
> Interesting idea.  It begs the question: should we allow any custom index
> at the RFile level?  If RFile indexes were user-extensible, then a
> visibility index would be something any developer could write.  That said,
> we can still include such an index as an example, and if we did it could be
> used by the Accumulo monitor.
>
> The RFile-level sampling followed this path.  I would support further work
> similar to it, though I admit I don't know how difficult a job it entails.
> Bonus points if the index information could be accessed from iterators the
> same way that sampled data can.
>
> I can't speak to the appropriateness of visibility histograms on the
> monitor *by default*, but it would be a strictly useful feature if it could
> be enabled via a conf option.
>
>
> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<josh.elser@gmail.com>  wrote:
>
>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
>> mentioned was the lack of insight into the distribution of data marked with
>> certain visibilities in a table. He presented an example similar to this:
>>
>> Image a hypothetical system backed by Accumulo which stores medical
>> information. There are three labels in the system: PRIVATE, ANONYMIZED, and
>> PUBLIC. PRIVATE data is that which could reasonably be considered to
>> identify the individual. ANONYMIZED data is some altered version of the
>> attribute that retains some portion of the original value, but is missing
>> enough context to not identify the individual (e.g. converting the name
>> "Josh Elser" to "J E"). PUBLIC data is for attributes which are cannot
>> identify the individual.
>>
>> Doctors would be able to read the PRIVATE data, while researchers could
>> only read the ANONYMIZED and PUBLIC data. This leads to a question: how
>> much of each kind of data is in the system? Without knowing how much data
>> is in the system, how can some application developer (who does not have the
>> ability to read all of the PRIVATE data) know that their application is
>> returning an reasonably correct amount of data? (there are many examples of
>> questions which could be answer on this data alone)
>>
>> Concretely, this histogram would look like (50 records with PRIVATE, 50
>> with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>>
>> ```
>> PRIVATE: 50
>> ANONYMIZED: 50
>> PUBLIC: 20
>> ```
>>
>> Technically, I think this would actually be relatively simple to
>> implement. Inside of each RFile, we could maintain some histogram of the
>> visibilities observed in that file. This would allow us to very easily
>> report how much data in each table has each visibility label.
>>
>> However, would this feature be harmful to one of the core tenants of
>> Accumulo? Or, is acknowledging the existence of data in Accumulo with a
>> certain visibility acceptable? Would a new permission to use such an API to
>> access this information be sufficient to protect the data?
>>
>> - Josh
>>
>

Mime
View raw message