accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Russ Weeks (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3970) Generating multiple views of a value at scan time
Date Sat, 22 Aug 2015 21:39:45 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14708195#comment-14708195
] 

Russ Weeks commented on ACCUMULO-3970:
--------------------------------------

Oh, yeah, absolutely you need to have application logic to figure out what visibilities to
provide to a scan. But I'm not trying to solve the deduplication of columns. I think my example
wasn't very clear. Let's say I store one KV pair in the system,
{code}
(pt_id, demographic, pt_dob, PII_DOB) -> "1925-08-22"
{code}
And I do a scan with SHD_DOB authorization. As it stands, I'd see nothing. If I had two KV
pairs like,
{code}
(pt_id, demographic, pt_dob, PII_DOB) -> "1925-08-22"
(pt_id, demographic, pt_dob, SHD_DOB) -> "1925"
{code}
And I do a scan with SHD_DOB authorization I'll see the de-identified date of birth. If I
have a user with both PII_DOB and SHD_DOB auths, we're in agreement that I should have some
application logic that says, "identified is better than deidentified" and provides the correct
subset of the user's auths. Or provide the full set and sort it out in a conventional iterator.
What I'm trying to get at is, I'd like to have:
{code}
(pt_id, demographic, pt_dob, PII_DOB) -> "1925-08-22"
{code}
In the table, and today, a scan with SHD_DOB authorization would return
{code}
(pt_id, demographic, pt_dob, SHD_DOB) -> "1925"
{code}
Because the patient is 89 years old. But tomorrow, the same scan would return
{code}
(pt_id, demographic, pt_dob, SHD_DOB) -> "1925 or earlier"
{code}
Because now the patient is 90 years old and their date of birth needs to be de-identified
differently. I guess I could achieve this by storing all 3 representations:
{code}
(pt_id, demographic, pt_dob, PII_DOB) -> "1925-08-22"
(pt_id, demographic, pt_dob, SHD_DOB) -> "1925"
(pt_id, demographic, pt_dob, SHD_DOB_TRUNC) -> "1925 or earlier"
{code}
But I think that approach becomes unwieldy very quickly.

> Generating multiple views of a value at scan time
> -------------------------------------------------
>
>                 Key: ACCUMULO-3970
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3970
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Russ Weeks
>            Priority: Minor
>             Fix For: 1.8.0
>
>
> It would be useful to have the ability to generate different representations of a key-value
pair at scan time, based on the scan authorizations.
> For example, consider [HIPPA safe harbour de-identification|http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html#dates].
One of the rules for de-identifying a patient's date of birth is that if a patient is 89 years
old or younger, you can disclose his exact year of birth. If a patient is 90 years old or
over, you pretend that he's 90 years old.
> You can imagine implementing this as a key/value mapping in accumulo like,
> {{(pt_id, demographic, pt_dob, PII_DOB) -> "1925-08-22"}}
> {{(pt_id, demographic, pt_dob, SHD_DOB) -> "1925"}}
> Where the value corresponding to visibility SHD_DOB is produced at scan-time, depending
on the patient's current age.
> Another example would be the ability to produce a salted hash of a unique identifier
like a social security number or medical record number, where the salt (or the hash algorithm,
or the work factor...) could be specified dynamically without having to re-code all the values
in the system.
> More broadly speaking, this feature would give organizations more flexibility to change
how they deidentify, transform or anonymize data to suit different access levels.
> Of course, to do this you'd need to have a pluggable component that can process key/value
pairs before visibilities are evaluated. I can see why this might give a lot of people the
heeby-jeebies but I'd like to gather as much feedback as possible. Looking forward to hearing
your thoughts!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message