hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase
Date Thu, 12 Mar 2015 19:52:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359257#comment-14359257

Steve Loughran commented on HADOOP-11708:

bq. FWIW, I just picked the first unreleased versions on the jira. 

OK, setting 2.8 as the target.

bq. It's chasing one undocumented and likely broken implementation with another one.

"Broken" is an opinion I'm not sure I agree with

# The behaviour is certainly not documented or explicitly specified in the [FS compatibility
# it is a stronger concurrency/consistency model than presented by {{OutputStream}}, so {{DFSOutputStream}}
can be used wherever an {{OutputStream}} is needed
# it's clear that this behaviour is expected in at least one application 

In  [FileSystem|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html]
, {{listStatus(), mkdirs()}} we do explicitly call out the atomicity/concurrency expectations
*as defined by HDFS*. Some of those are not the result of deliberate decisions —the fact
that mkdirs() is atomic is due to the NN grabbing a lock for optimised directory path creation— but
they are behaviours that we have to accept as defacto standards as defined by applications-running-above-HDFS.
All we can do is document them for the benefit of other filesystems seeking Hadoop HDFS compatibility,
and try not to change them in HDFS such that applications break. Having that documentation
to call out concurrency semantics on output streams is the way to do this. Given that the
HDFS encryption is intended to be transparent, it's going to have to have a consistent concurrency
& consistency model. 

> CryptoOutputStream synchronization differences from DFSOutputStream break HBase
> -------------------------------------------------------------------------------
>                 Key: HADOOP-11708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11708
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.6.0
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>            Priority: Critical
> For the write-ahead-log, HBase writes to DFS from a single thread and sends sync/flush/hflush
from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it is not
thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When it is the
stream FSDataOutputStream wraps, the combination is threadsafe for 1 writer and multiple syncs
(the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted between
FSDataOutputStream and DFSOutputStream. It is proactively labeled as not thread safe, and
this composition is not thread safe for any operations.

This message was sent by Atlassian JIRA

View raw message