accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3970) Generating multiple views of a value at scan time
Date Sat, 22 Aug 2015 06:18:45 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707910#comment-14707910
] 

ASF GitHub Bot commented on ACCUMULO-3970:
------------------------------------------

GitHub user rweeks opened a pull request:

    https://github.com/apache/accumulo/pull/43

    Accumulo 3970

    Here's a sketch of what I'm thinking for ACCUMULO-3970. It's by no means ready to be merged,
I just want to send the PR as a starting point for a discussion.
    
    The idea is to define a new type of iterator, a VisibilityTransformingIterator, which
can be set on a table. The iterator is only active in scan scope and is applied before the
scan authorizations. Concrete subclasses of the VisibilityTransformingIterator receive individual
key-value pairs that are present in the table and, for each pair, produce zero or more extra
representations of that value. How these extra representations are produced will probably
be driven by how an organization wants to anonymize or de-identify its data.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/phemisystems/accumulo ACCUMULO-3970

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/43.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #43
    
----
commit 646974ebe519246ea5340f05211972a8ae951543
Author: Russ Weeks <rweeks@phemi.com>
Date:   2015-08-15T04:53:51Z

    Added test for VisibilityTransformingIterator

commit e6376c8011346c9ef8eb3d0ac762f504a6fa42e3
Author: Russ Weeks <rweeks@newbrightidea.com>
Date:   2015-08-16T03:49:44Z

    Unit tests passing

commit a72ed29175970158b216d0233aba39eff8cb4c0a
Author: Russ Weeks <rweeks@newbrightidea.com>
Date:   2015-08-21T18:48:22Z

    Merge remote-tracking branch 'origin/master' into vti

commit b4d89da57603dff4f8db216c8835948c21c6d689
Author: Russ Weeks <rweeks@newbrightidea.com>
Date:   2015-08-22T06:13:01Z

    Fixing my new Property definition. Seems like a CLASSNAME type property can't have a default
value of null

commit 06718e203fe7fd22d4b8639ff3b76382b2090333
Author: Russ Weeks <rweeks@newbrightidea.com>
Date:   2015-08-22T06:14:23Z

    Removing some commented-out code

----


> Generating multiple views of a value at scan time
> -------------------------------------------------
>
>                 Key: ACCUMULO-3970
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3970
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Russ Weeks
>            Priority: Minor
>             Fix For: 1.8.0
>
>
> It would be useful to have the ability to generate different representations of a key-value
pair at scan time, based on the scan authorizations.
> For example, consider [HIPPA safe harbour de-identification|http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html#dates].
One of the rules for de-identifying a patient's date of birth is that if a patient is 89 years
old or younger, you can disclose his exact year of birth. If a patient is 90 years old or
over, you pretend that he's 90 years old.
> You can imagine implementing this as a key/value mapping in accumulo like,
> {{(pt_id, demographic, pt_dob, PII_DOB) -> "1925-08-22"}}
> {{(pt_id, demographic, pt_dob, SHD_DOB) -> "1925"}}
> Where the value corresponding to visibility SHD_DOB is produced at scan-time, depending
on the patient's current age.
> Another example would be the ability to produce a salted hash of a unique identifier
like a social security number or medical record number, where the salt (or the hash algorithm,
or the work factor...) could be specified dynamically without having to re-code all the values
in the system.
> More broadly speaking, this feature would give organizations more flexibility to change
how they deidentify, transform or anonymize data to suit different access levels.
> Of course, to do this you'd need to have a pluggable component that can process key/value
pairs before visibilities are evaluated. I can see why this might give a lot of people the
heeby-jeebies but I'd like to gather as much feedback as possible. Looking forward to hearing
your thoughts!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message