hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-6311) Add support for unix domain sockets to JNI libs
Date Tue, 16 Oct 2012 21:03:04 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477345#comment-13477345
] 

Todd Lipcon commented on HADOOP-6311:
-------------------------------------

Hi Colin,

Thanks for writing up the design doc. I think it probably should actually go on HDFS-347,
which is the overall feature JIRA, rather than this one, which is about one of the implementation
subtasks. But, anyway, here are some comments:

{quote}
* Portable.  Hadoop supports multiple operating systems and environments,
including Linux, Solaris, and Windows.
{quote}

IMO, there is not a requirement that performance enhancements work on all systems. It is up
to the maintainers of each port to come up with the most efficient way to do things. Though
there is an active effort to get Hadoop working on Windows, it is not yet a requirement.

So long as we maintain the current TCP-based read (which we have to, anyway, for remote access),
we'll have portability. If the Windows port doesn't initially offer this feature, that seems
acceptable to me, as they could later add whatever mechanism makes the most sense for them.

{quote}
* High performance.  If performance is compromised, there is no point to any of
this work, because clients could simply use the existing, non-short-circuit
write pathways to access data.
{quote}
Should clarify that the performance of the mechanism by which FDs are _passed_ is less important,
since the client will cache the open FDs and just re-use them for subsequent random reads
against the same file (the primary use case for this improvement). So long as the overhead
of passing the FDs isn't huge, we should be OK.

{quote}
There are other problems.  How would the datanode clients and the server decide
on a socket path?  If it asks every time prior to connecting, that could be
slow.  If the DFSClient cached this socket path, how long should it cache it
before expiring the cache?  What happens if the administrator does not properly
set up the socket path, as discussed earlier?  What happens if the
administrator wants to put multiple DataNodes on the same node?
{quote}
Per above, slowness here is not a concern, since we only need to do the socket-passing on
file open. HDFS applications generally open a file once and then perform many many reads against
the same block before opening the next block.

As for how the socket path is communicated, why not do it via an RPC? For example, in your
solution #3, we're using an RPC to communicate a cookie. Instead of that, it can just return
its abstract namespace socket name. (You seem to propose this under solution #3 below, but
here in solution #1 reject it)

Another option would be to add a new field to the DatanodeId/DatanodeRegistration: when the
client gets block locations it could also include the socket paths.

{quote}
The response is not a path, but a 64-bit cookie.  The DFSClient then connects
to the DN via a UNIX domain socket, and presents the cookie.  In response, he
receives the file descriptor.
{quote}

I don't see the purpose of the cookie, still, since it adds yet another opaque token, and
requires the DN code to "publish" the file descriptor with a cookie, and we end up with extra
data structures, cached open files, cache expiration policies, etc.

----

{quote}
Choice #3.  Blocking FdServer versus non-blocking FdServer.
Non-blocking servers in C are somewhat more complex than blocking servers.
However, if I used a blocking server, there would be no obvious way to
determine how many threads it should use.  Because it depends on how busy the
server is expected to be, only the system administrator can know ahead of time.
Additionally, many schedulers do not deal well with a large number of threads,
especially on older versions of Linux and commercial UNIX variants.
Coincidentally, these happen to be the exactly kind of systems many of our
users run.
{quote}

I don't really buy this. The socket only needs to be active long enough to pass a single fd,
which should take a few milliseconds. The number of requests for fd-passing is based on the
number of block opens, _not_ the number of reads. So a small handful of threads should be
able to handle even significant workloads just fine. We also do fine with threads on the data
xceiver path, often configured into the hundreds or thousands.

{quote}
Another problem with blocking servers is that shutting them down can be
difficult.  Since there is no time limit on blocking I/O, a message sent to the
server to terminate may take a while, or possibly forever, to be acted on.
This may seem like a trivial or unimportant problem, but it is a very real one
in unit tests.  Socket receive and send timeouts can reduce the extra time
needed to shut down, but never quite eliminate it.
{quote}
Again I don't buy it, we do fine with blocking IO everywhere else.. Why is this context different?

----

*Wire protocol*

The wire protocol should use protobufs, so it can be evolved in the future. I also am still
against the cookie approach. I think a more sensible protocol would be the following:

0) The client somehow obtains the socket path (either by RPC to the DN or by it being part
of the DatanodeId)
1) Client connects to the socket, and sends a message which is a fully formed protobuf message,
including the block ID and block token
2) JNI server receives the message and passes it back to Java, which parses it, opens the
block, and passes file descriptors (or an error) back to JNI.
3) JNI server sends the file descriptors along with a response protobuf
4) JNI client receives protobuf data and optionally fds (in success), forwards them back to
Java where protobuf decode happens

This eliminates the need for a new cookie construct, and a bunch of data structures on the
server side. The JNI code becomes stateless except for its job of accept(), read(), and write().
It's also extensible due to the use of protobufs (which is helpful if the server later needs
to provide some extra information about the format of the files, etc)

The APIs would then become something like:
{code}
class FdResponse {
  FileDescriptor []fds;
  byte[] responseData;
}

interface FdProvider {
  FdResponse handleRequest(byte[] request);
}

class FdServer { // implementation in JNI
  FdServer(FdProvider provider);
}
{code}
(the FdServer in C calls back into the FdProvider interface to handle the requests)

----

*Security:*

One question: if you use the autobinding in the abstract namespace, does that prevent a later
attacker from explicitly picking the same name? ie is the "autobind" namespace fully distinct
from the other one? Otherwise I'm afraid of the following attack:

- malicious client acts like a reader, finds out the socket path of the local DN
- client starts a while-loop, trying to bind that same socket
- DN eventually restarts due to normal operations. client seizes the socket
- other clients on that machine have the cached socket path and get man-in-the-middled

I think this could be somewhat mitigated by using the credentials-checking facility of domain
sockets: the client can know the proper uid of the datanode, and verify against that. Otherwise
perhaps there is some token-like scheme we could use. But we should make sure this is foolproof.

If the above is indeed a possibility, I'm not sure the abstract namespace buys us anything
over using a well-defined path (eg inside /var/run), since we'd need to do the credentials
check anyway.

----

*Overall thoughts*

I agree in general with your arguments about the complexity of the unix socket approach and
its non-portability, but let me also bring to the table a couple arguments _for_ it:

As history shows, I was originally one of the proponents of the short-circuit read approach
(in HDFS-347, etc). It provided much better performance than loopback TCP, especially for
random read. For sequential read, it also uses much less CPU: in particular I see a lot of
system CPU being used by loopback TCP sockets for both read and write on the non-short-circuit
code paths. Short circuit avoids this.

But my experience with HDFS-2246 and in thinking about some other upcoming improvements, there
are a couple downsides as well:

1) Metrics: because the client gains full control of the file descriptors, the datanode no
longer knows anything about the amount of disk IO being used by each of its clients, or even
the total aggregate. This makes the metrics under-reported, and I don't see any way around
it.

In addition to throughput metrics, we'll also end up lacking latency statistics against the
local disk in the datanode. We now have latency percentiles for disk access, and that will
be useful to identify dying/slow disks for applications like HBase. We don't get that with
short-circuit.

2) Fault handling: if there's an actual IO error on the underlying disk, the client is the
one who sees it instead of the datanode. This means that the datanode won't necessarily mark
the disk as bad. We should figure out how to address this for the short-circuit path (e.g
the client could send an RPC to the DN which asks it to move a given block to the front of
the block scanner queue upon hitting an error)

3) Client thickness: there has recently been talk of changing the on-disk format in the datanode,
for example to introduce inline checksums in the block data. Currently, such changes would
be datanode-only, but with short circuit the client also needs an update. We need to ensure
that whatever short circuit code we write, we have suitable fallbacks: eg by transferring
some kind of disk format version identifier in the RPC, and having the DN reject the request
if the client won't be able to handle the storage format.

4) Future QoS enhancements: currently there is no QoS/prioritization construct within the
datanode, but as mixed workload clusters become more popular (eg HBase + MR) we would like
to be able to introduce QoS features. If all IO goes through the datanode, then this is far
more feasible. With short-circuit, once we hand out an fd, we've lost all ability to throttle
a client.

So, I'm not _against_ the fd passing approach, but I think these downsides should be called
out in the docs, and if there's any way we can think of to mitigate some of them, that would
be good to consider.

I'd also be really interested to see data on performance of the existing datanode code running
on either unix sockets or on a system patched with the new "TCP friends" feature which eliminates
a lot of the loopback overhead. According to some benchmarks I've read about, it should cut
CPU usage by a factor of 2-3, which might make the win of short-circuit much smaller. Has
anyone done any prototypes in this area? The other advantage of this approach (non-short-circuit
unix sockets instead of TCP) is that it would improve performance of the write pipeline as
well, where I currently see a ton of overhead due to TCP in the kernel.

                
> Add support for unix domain sockets to JNI libs
> -----------------------------------------------
>
>                 Key: HADOOP-6311
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6311
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: native
>    Affects Versions: 0.20.0
>            Reporter: Todd Lipcon
>            Assignee: Colin Patrick McCabe
>         Attachments: 6311-trunk-inprogress.txt, design.txt, HADOOP-6311.014.patch, HADOOP-6311.016.patch,
HADOOP-6311.018.patch, HADOOP-6311.020b.patch, HADOOP-6311.020.patch, HADOOP-6311.021.patch,
HADOOP-6311.022.patch, HADOOP-6311-0.patch, HADOOP-6311-1.patch, hadoop-6311.txt
>
>
> For HDFS-347 we need to use unix domain sockets. This JIRA is to include a library in
common which adds a o.a.h.net.unix package based on the code from Android (apache 2 license)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message