accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Kunkle <kunkl...@gmail.com>
Subject Re: "NOT" operator in visibility string
Date Wed, 19 Mar 2014 18:04:07 GMT
> Is a sandbox the same thing as a workspace? Can the terms be used interchangeably? Just
want to make sure I'm not misinterpreting your answers.

Yes. Sorry I wasn’t consistent with the terminology. 

> Is it fair to describe each sandbox as a separate index table for the global data set?
And then when users do deletes, it is only reflected in the index fields, right?

Not quite. The sandbox is a pointer only. Changes are made directly to the global data, scoped
with an additional workspace visibility. For edits (which I haven’t been talking about),
this ends up being a new column because of the visibility change. For a delete, we’d like
to simply add a !workspace_name visibility and delete the original column.

> But you can't just delete values from the index because you need to keep track of the
changes in case the user decides to delete globally (after appropriate authorization checks,
etc...)

Correct. The user may also choose to undo/abandon the delete.

> Because the visibility is part of the key, changing it involves re-writing the data.
Which might be just an index record in your case. However, this is generally an expensive
operation.

It would be an operation to the global data with a workspace-scoped visibility. It wouldn’t
be terribly expensive with a NOT operator in the case of deletes because there’s only one
record to change.

I really appreciate you thinking about this problem Mike. My team has spent a long time discussing
a solution and felt the NOT operator would work best for our situation. We’re happy to consider
other possible approaches though too.


On Mar 19, 2014, at 1:46 PM, Mike Drob <madrob@cloudera.com> wrote:

> Thanks, that's really helpful. Couple more questions.
> 
> Is a sandbox the same thing as a workspace? Can the terms be used interchangeably? Just
want to make sure I'm not misinterpreting your answers.
> 
> Is it fair to describe each sandbox as a separate index table for the global data set?
And then when users do deletes, it is only reflected in the index fields, right?
> But you can't just delete values from the index because you need to keep track of the
changes in case the user decides to delete globally (after appropriate authorization checks,
etc...)
> 
> Because the visibility is part of the key, changing it involves re-writing the data.
Which might be just an index record in your case. However, this is generally an expensive
operation.
> 
> I think I need to think on this use case some more, it's definitely interesting and not
something I had considered before.
> 
> 
> On Wed, Mar 19, 2014 at 1:24 PM, Jeff Kunkle <kunklejr@gmail.com> wrote:
>> You have a large amount of data, that is generally readable by all users.
> 
> Not necessarily. All data has some visibility constraint that a users authorization's
may or may not satisfy. 
> 
>> Users create their own sandbox, from which they can later exclude portions of the
global data set.
> 
> Yes, users create their own sandboxes which are populated with global data. They may
decide to delete some of that data and the change needs to be scoped to their sandbox until
the change is published globally.
> 
>> User can share their sandbox with others, so really we are talking about sandbox
permissions and not so much user permissions.
> 
> Yes, users can share their sandbox with others, but a sandbox is just a collection of
pointers to data. Users sharing a workspace may not necessarily see all of the same data depending
on their authorizations.
> 
>> Sandboxes are created often. Or, at least much more often than the data changes.
> 
> Yes, sandboxes are created often. The data is likely to be ingested more frequently than
sandboxes will be created.
> 
>> Do users typically remove large amounts of data from their sandbox? 1%? 10%? 99%?
> 
> I don’t have good numbers to share here.
> 
>> Assuming data is removed via rules, are the rules applied automatically to new data
under ingest?
> I would say no, although I’m not positive I understand the question. Users are not
removing data from their sandbox per se, but they may delete data that should then be hidden
from their workspace. The data is not really deleted though and is still visible to other
users in other sandboxes. Only when the deletion is published does it get deleted for everyone.
> 
> On Mar 19, 2014, at 1:03 PM, Mike Drob <madrob@cloudera.com> wrote:
> 
>> Wait, I'm really confused by what you are describing, Jeff. Sorry if these are obvious
questions, but can you help me get a better grasp of your use case?
>> 
>> You have a large amount of data, that is generally readable by all users.
>> Users create their own sandbox, from which they can later exclude portions of the
global data set.
>> User can share their sandbox with others, so really we are talking about sandbox
permissions and not so much user permissions.
>> Sandboxes are created often. Or, at least much more often than the data changes.
>> 
>> Are those all accurate statements? If so, can you clarify the following points:
>> 
>> Do users typically remove large amounts of data from their sandbox? 1%? 10%? 99%?
>> Assuming data is removed via rules, are the rules applied automatically to new data
under ingest?
>> 
>> Thanks,
>> Mike
>> 
>> 
>> On Wed, Mar 19, 2014 at 12:54 PM, Jeff Kunkle <kunklejr@gmail.com> wrote:
>> Hi John,
>> 
>> Yes it’s accurate that the system controls the label and who is associated with
it; there are no Accumulo-internal user accounts. But I don’t think it’s feasible to remove
a sandbox label from something that should be hidden. Such a scenario would imply that all
data is “tagged” with the labels of every sandbox that is allowed to see the data, which
would be most. It would also imply that the creation of a new sandbox would necessitate changing
the visibility of everything in Accumulo to include the new sandbox label, effectively rewriting
the entire database. Sanboxes are created and deleted all the time in our application, so
it doesn’t seem like a feasible solution to me.
>> 
>> -Jeff
>> 
>> On Mar 19, 2014, at 12:16 PM, Josh Elser <josh.elser@gmail.com> wrote:
>> 
>> > It kind of sounds like you could manage this much easier by controlling the
authorizations a user gets (notably the workspace name) and the grant/revoke above the Accumulo
level.
>> >
>> > A sandbox has a unique label and the external system controls which users are
granted that label. This way, each sandbox can be modified individually (using authorizations
that contain the data visibility and the sandbox label) or the original data set could be
modified (by omitting a sandbox label in the authorizations used).
>> >
>> > Is that accurate?
>> >
>> > On 3/19/14, 12:05 PM, Jeff Kunkle wrote:
>> >> I attempted to simplify the scenario to facilitate discussion, which on
>> >> second thought may have been a mistake. Here’s the whole scenario:
>> >>
>> >> Different users have access to different subsets of the data depending
>> >> on their authorizations and the visibility of the data. Users “work
>> >> with” the data in what we call a sandbox. Sanboxes can be shared with
>> >> other users (this is the group creation I was talking about earlier).
>> >> Deletes to the data would be “scoped” to the sandbox by changing the
>> >> visibility to add “& !workspace_name” so that people viewing the
>> >> workspace wouldn’t see the data but everyone else would.
>> >>
>> >> On Mar 19, 2014, at 11:48 AM, Sean Busbey <busbey+lists@cloudera.com
>> >> <mailto:busbey+lists@cloudera.com>> wrote:
>> >>
>> >>> On Wed, Mar 19, 2014 at 10:43 AM, Jeff Kunkle <kunklejr@gmail.com
>> >>> <mailto:kunklejr@gmail.com>> wrote:
>> >>>
>> >>>    New groups are created on the fly by our application when needed.
>> >>>    Under the scenario you describe we’d have to go through all the
>> >>>    data in Accumulo whenever a group is created so that users in the
>> >>>    group can see the existing data.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Ah! So your use case is that all data defaults to world readable and
>> >>> then users have the option of opting out of seeing subsets. Right?
>> >>>
>> >>> In your scenario user groups also get to opt-out of seeing data on the
>> >>> fly, yes? Both require rewriting the data. Does the group creation
>> >>> happen more often?
>> >>
>> 
>> 
> 
> 


Mime
View raw message