hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (Jira)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-16522) Encrypt S3A buffered data on disk
Date Tue, 20 Aug 2019 17:07:00 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-16522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Steve Loughran updated HADOOP-16522:
    Summary: Encrypt S3A buffered data on disk  (was: Encrypt buffered data on disk)

> Encrypt S3A buffered data on disk
> ---------------------------------
>                 Key: HADOOP-16522
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16522
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Mike Yoder
>            Priority: Major
> This came out of discussions with [~stevel@apache.org], [~irashid] and [~vanzin].
> Imran:
> {quote}
> Steve pointed out to me that the s3 libraries buffer data to disk.  This is pretty much
arbitrary user data.
> Spark has some settings to encrypt data that it writes to local disk (shuffle files etc.). 
Spark never has control of what arbitrary libraries are doing with data, so it doesn't guarantee
that nothing ever ends up on disk -- but to the end user, they'd view those s3 libraries as
part of the same system.  So if a user is turning on spark's local-disk encryption, the users
would be pretty surprised to find out that the data they're writing to S3 ends up on local-disk,
> {quote}
> Me:
> {quote}
> ... Regardless, this is still an s3a bug.
> {quote}
> Steve:
> {quote}
> I disagree
> we need to save intermediate data "somewhere" -people get a choice of disk or memory.
> encrypting data on disk was never considered as needed, on the basis that anyone malicious
with read access under your home dir could lift the hadoop token file which YARN provides
and so have full R/W access to all your data in the cluster filesystems until those tokens
expire. If you don't have a good story there then the buffering of a few tens of MB of data
during upload is a detail. 
> There's also the extra complication that when uploading file blocks, we pass in the filename
to the AWS SDK and let it do the uploads, rather than create the output stream; the SDK code
has, in the past, been better at recovering failures there than output stream + mark and reset.
that was a while back; things may change. But it is why I'd prefer any encrypted temp store
as a new buffer option, rather than just silently change the "disk" buffer option to encrypt
> Be interesting to see where else in the code this needs to be addressed; I'd recommend
looking at all uses if org.apache.hadoop.fs.LocalDirAllocator and making sure that Spark YARN
launch+execute didn't use this indirectly
> JIRAs under HADOOP-15620 welcome; do look at the test policy in the hadoop-aws docs;
we'd need a new subclass of AbstractSTestS3AHugeFiles for integration testing a different
buffering option, plus whatever unit tests the encryption itself needed.
> {quote}
> Me:
> {quote}
> I get it. But ... there are a couple of subtleties here. One is that the tokens expire,
while the data is still data. (This might or might not matter, depending on the threat...)
Another is that customer policies in this area do not always align well with common sense.
There are blanket policies like "data shall never be written to disk unencrypted" which we
have come up against, which we'd like to be able to honestly answer in the affirmative.  We
have encrypted MR shuffle as one historical example, and encrypted impala memory spills as
> {quote}

This message was sent by Atlassian Jira

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message