hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10899) Add functionality to re-encrypt EDEKs.
Date Thu, 09 Feb 2017 02:14:41 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858880#comment-15858880
] 

Andrew Wang commented on HDFS-10899:
------------------------------------

Hi Xiao, I soaked in this patch for a while, had a bunch of review comments. I didn't get
through everything, but I think this is worth posting now, and I'll give a more detailed review
of the next rev:

ReencryptionHandler:
* Rather than having get/setProvider, can we set it in the EZManager constructor by getting
it from {{dir.getFSNamesystem().getProvider()}} ? Then we don't need the retry loop in {{run}}
either.
* In FSEditLog, normally we log one edit per call to a {{log}} function. This is broken by
{{logReencryptZoneStatus}}, could you change it so ReencryptionHandler does the for loop over
the batch calling {{logSetXAttrs}} then a final call to {{logReencryptZoneStatus}} ?
* We shouldn't call logSync inside the write lock in reencryptDir, since it's a time consuming
operation. In the RPC context, we call this outside the lock, just before responding to the
caller. This means there's a little gap where readers could see uncommitted data. For reencrypt,
this is actually very safe since the old and new EDEK are functionally equivalent, and we
aren't providing any guarantees while the zone is still reencrypting.
* We need the write lock held when logging to the FSEditLog.
* I'd prefer we gather all the file EDEK information in one shot with just the read lock held,
do all the KMS communication, then take the writelock at the end to log the batch. This also
makes it easier to multi-thread the KMS communication stage later, and better performance
since we're doing more work under the lock.
* There are lock/unlocks across functions which makes this harder to understand. For instance,
unlock then relock in reencryptINode because it was locked in the caller. Consider refactoring?
It'd be great if all locking was handled with simple lock/unlock in try/finally.
* restoreReencryptFromProgress, javadoc mentions {{lastFileProcessed}} but abbreviations here
and below are {{lpf}}.
* Is it possible to rework reencryptEncryptionZoneInt so we only call reencryptDir once after
setting up the iteration state? I think this would also make restoreReencryptFromProgress
clearer, since it could have fewer arguments.
* Would also consider renaming restoreReencryptFromProgress, like {{restoreFromLastFileProcessed}}
* Is it possible to encapsulate the file iteration as an iterator? We pass {{subdirs}} around
a lot right now.

ReencryptionStatus:
* Add Precondition check to add, to make sure that it's not added if already present

EZManager:
* typo: zondId
* What does the "Id" in {{getIdRootEncryptionZone}} mean?
* getIdRootEncryptionZone checks for read in resolvePath, but aren't reencrypt and cancel
write operations? We don't want to be running this on snapshot paths. I think we should push
the resolvePath up to the caller of this function, since it also lets us reuse the IIP if
necessary (resolvePath is expensive).
* In FSNamesystem, looks like we call {{resolvePath}} in reencryptEncryptionZoneInt but not
the IIP version. The normal flow is to resolvePath to an IIP in the corresponding FSDir file,
then pass it down to avoid resolving again for performance reasons.

ReencryptionHandler#checkNNReadyForWrite and catching exceptions
* {{checkNNReadyForWrite}} checks that the NN is active. The handler should only be live when
the NN is active though, stopActiveServices will interrupt the thread. If we're worried about
this, it'd be better to periodically check for the interrupt and throw InterruptedException.
* With this, we can change {{checkNNReadyForWrite}} into a simple safemode check. Since exceptions
are expensive, I'd prefer a boolean return value and if statement instead.
* Note also that safemode checks need to be done with the lock held, so we should do it where
we're grabbing the lock anyway. For correctness, I think we only need to check this just before
we do a write (i.e. apply the new EDEKs to our in-memory state). We can still check elsewhere
to save doing extra work.
* Related to the above, in two places in {{run}} we catch InterruptedException and just log
and return. This is a code smell; although it's the top-level function, it's still better
to re-interrupt the thread before returning.
* I like this article I found on InterruptedException: https://www.ibm.com/developerworks/library/j-jtp05236/
* Catching Exception in {{run}} is a code smell. What is the intent? It looks like we already
catch the checked exceptions, so this will catch RuntimeExceptions (which are normally unrecoverable).

Metrics:
* I'd prefer that we track the per zone metrics in ReencryptionStatus, since it's what also
tracks the currentZone. This avoids having to reset the metrics on removeZone.
* Follow-on idea: it'd be nice for admins to be able to query the status of queued and running
reencrypt commands. Progress indicators like submission time, start time, # skipped, # reencrypted,
total # (if this is cheap to get) would be helpful.

Others:
* Looks like we aren't using the op cache in FSEditLog SetXAttrOp / RemoveXAttrOp. I think
this is accidental, could you do some research? Particularly since we'll be doing a lot of
SetXAttrOps, avoiding all that object allocation would be nice. This could be a separate JIRA.
* DFSTestUtil, the new comment says "Pause re-encrypt to allow the cancel command can go through",
but IIUC this isn't about authorization but instead pausing reencryption so there's an active
zone reencryption to cancel. Could you reword?
* fsimage.proto and loader code, since the lastFile belongs to a specific zone, I'd prefer
we define the on-disk format like this rather than needing to know about zone ordering:
* Rather than making FSNamesystem#getFileInfo public for tests, can we dig it out from the
RPC interface?

{code}
message EncryptionManagerSection {
  message ReencryptCommand {
    required uint64 zone     = 1;
    optional string lastFile = 2;
  }
  repeated ReencryptCommand reencryptCommands = 1;
}
{code}

With the current on-disk format, we can only reencrypt one zone at a time. Later, we might
want to reencrypt in parallel.

Follow-on:
* This could be difficult with all the lock/unlocks and stages, but I'd prefer a goal-pause-time
configuration for the {{run}} loop. This is easier for admins to reason about. We would still
use the batch size for determining when to log a batch.

> Add functionality to re-encrypt EDEKs.
> --------------------------------------
>
>                 Key: HDFS-10899
>                 URL: https://issues.apache.org/jira/browse/HDFS-10899
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: encryption, kms
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>         Attachments: editsStored, HDFS-10899.01.patch, HDFS-10899.02.patch, HDFS-10899.03.patch,
HDFS-10899.04.patch, HDFS-10899.05.patch, HDFS-10899.06.patch, HDFS-10899.07.patch, HDFS-10899.wip.2.patch,
HDFS-10899.wip.patch, Re-encrypt edek design doc.pdf
>
>
> Currently when an encryption zone (EZ) key is rotated, it only takes effect on new EDEKs.
We should provide a way to re-encrypt EDEKs after the EZ key rotation, for improved security.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message