accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3970) Generating multiple views of a value at scan time
Date Sat, 22 Aug 2015 21:16:46 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14708180#comment-14708180
] 

Josh Elser commented on ACCUMULO-3970:
--------------------------------------

Thinking about the problem in a different way, might it be less intrusive to encode the rules
for alternate views of data at ingest time and then push down the logic to deduplicate columns
with multiple visibilities? I think this would prevent the need to inject down in the system
visibility code and perform the merge/filter in "userspace iterators".

For your DOB example, the table would actually contain both records:

{noformat}
(pt_id, demographic, pt_dob, PII_DOB) -> "1925-08-22"
(pt_id, demographic, pt_dob, SHD_DOB) -> "1925"
{noformat}

At scan time, if the user has the ability to view this patient's personal data, they would
scan with the PII_DOB visibility. Otherwise, they'd only have the SHD_DOB visibility. In the
former case, you'd want to only present one value for demographic:pt_dob (not both) "1925-08-22".
This lets you perform this filter on the server instead of unwinding it on the client. It
does require logic in the driving application to assign the visibilities accordingly, but
I think your original suggestion would also require this.

> Generating multiple views of a value at scan time
> -------------------------------------------------
>
>                 Key: ACCUMULO-3970
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3970
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Russ Weeks
>            Priority: Minor
>             Fix For: 1.8.0
>
>
> It would be useful to have the ability to generate different representations of a key-value
pair at scan time, based on the scan authorizations.
> For example, consider [HIPPA safe harbour de-identification|http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html#dates].
One of the rules for de-identifying a patient's date of birth is that if a patient is 89 years
old or younger, you can disclose his exact year of birth. If a patient is 90 years old or
over, you pretend that he's 90 years old.
> You can imagine implementing this as a key/value mapping in accumulo like,
> {{(pt_id, demographic, pt_dob, PII_DOB) -> "1925-08-22"}}
> {{(pt_id, demographic, pt_dob, SHD_DOB) -> "1925"}}
> Where the value corresponding to visibility SHD_DOB is produced at scan-time, depending
on the patient's current age.
> Another example would be the ability to produce a salted hash of a unique identifier
like a social security number or medical record number, where the salt (or the hash algorithm,
or the work factor...) could be specified dynamically without having to re-code all the values
in the system.
> More broadly speaking, this feature would give organizations more flexibility to change
how they deidentify, transform or anonymize data to suit different access levels.
> Of course, to do this you'd need to have a pluggable component that can process key/value
pairs before visibilities are evaluated. I can see why this might give a lot of people the
heeby-jeebies but I'd like to gather as much feedback as possible. Looking forward to hearing
your thoughts!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message