hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7163) WebHdfsFileSystem should retry reads according to the configured retry policy.
Date Thu, 17 Dec 2015 22:25:46 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Eric Payne updated HDFS-7163:
    Attachment: HDFS-7163.005.patch

[~daryn], thank you very much for your in-depth analysis and helpful comments!


1. We may want to defer the open until a read occurs.
I have done that in this patch, but it caused some of the unit tests to fail because they
were expecting the input stream to be open after the {{fs.open()}} call. I had to change these
- {{FSXAttrBaseTest}}
- {{TestAuditLogs}}
- {{TestWebHdfsFileSystemContract}}
- {{TestWebHdfsTokens}}

My question is, are we comfortable that nothing is depending on the current behavior?

bq. 2. {{runnerState}} ... should just be initialized as DISCONNECTED.

bq. 3. If read(...) throws an IOE due to an explicitly closed stream, will retries occur?
No. The check for the explicitly closed state happens outside of the retry logic.

bq. 4. In {{connect(URL)}}, Calling it {{cachedConnection}} would clarify its purpose.

5. In getResponse:
    5.1. Should {{initializeInputStream}} be unconditionally invoked inside the prior null
check on connection? Ie. Is there ever a case when it shouldn't be initialized when a new
connection is made?
    5.2. I think the logic should be if (conn != cachedConnection) { cachedConnection = conn;
in = initializeInputStream(cachedConnection) }
If the connection is not cached, initialization always needs to happen. However, the converse
is not true. That is, even if connection is cached, initialization still may need to happen.

For a seek, the connection is cached into {{cachedConnection}} by {{ReadRunner#read}} after
invoking the {{URLRunner}} to make the connection. The {{URLRunner}} is used rather than the
{{ReadRunner}} so that {{AbstractRunner#connect}} can be told that the URL has already been
redirected. On the ohter hand, for a regular read (non-seek case), the {{ReadRunner#connect}}
makes the connection, but {{cachedConnection}} isn't cached until {{eadRunner#getResponse}}
because we want {{validateResponse}} to be run before caching the connection.

So, in {{ReadRunner#getResponse}}, in the seek case, {{cachedConnection}} will be non-null,
but the input stream ({{in}}) will be null. In the regular read case, both will be null.

So, I took out the check for if {{cachedConnection}} is null and always cache it, but I kept
the check for if {{in}} is null. I realize that {{cachedConnection}} doesn't always need to
be cached, but the performance cost is small and it makes the code cleaner.

bq.    5.3. Should use URL#getAuthority instead of explicitly extracting and joining the host
and port.

bq. 6. In ReadRunner#initializeInputStream has a misspelled "performznt".

bq. 7. In {{closeInputStream}}, I'd use {{IOUtils.closeStream}} to ensure the close doesn't
throw which would prevent the stream state from being updated.
I replaced {{in.close()}} with {{IOUtils.close(cachedConnection)}}. Is that what you meant?

bq. 8. In general the state management isn't clear. DISCONNECTED vs SEEK appear to be the
same, with the exception that SEEK allows the connection to be reopened. When errors occur
and the stream is DISCONNECTED, are you sure it will retry/recover in all cases?
I've done quite a bit of manual testing in a full cluster with reasonably substantial files
(16GB). Can you be more specific about your concerns?

As far as each state is concerned, SEEK and DISCONNECTED are a little different than your
comment. Let me try to explain in a little more detail
Connection is closed programmatically by ReadRunner after an exception has occurred. {{ReadRunner}}
will attempt to open a new connection if it is retried while in this state.
Connection has been successfully established by {{ReadRunner}}. This occurs after the input
stream has been initialized.
{{ReadRunner}} will only be put in this state if the user code has explicitly called seek().
{{ReadRunner}} will use this state as a trigger to perform a redirected connection (as I have
discussed above in my reply to your point, 5.1). Once the connection is established and the
input stream is initialized, the {{RunnerState}} will move to OPEN. Retries will not be attempted
while in this state. If an IOException occurs while {{URLRunner}} is attempting to open a
redirected connection, {{ReadRunner}} will move to the DISCOMMECTED state and retry via the
normal read path.
{{ReadRunner}} is put in this state when user code has explicitly called close().

Also, as part of this patch, I added a {{RunnerState}} parameter to the {{closeInputStream}}
method. These two are not necessarily tied together, but it does make it clearer (at least
in my mind) which state {{ReadRunner}} will be moving to as a result of the action. If you
are uncomfortable with that, I can separate them out.


1. In testWebHdfsReadRetries
    1.1. A 5m timeout seems overly generous for something that will hopefully fail much faster
if there's a problem.
Changed to 90 seconds.

bq.    1.2. Why the 5s safemode extension? Seems like it will unnecessarily slow down the
I think that was just from code I copied from the {{MiniDfsCluster}} test. I removed it.

bq.    3.The healthy check on dfs appears redundant since cluster#waitActive() already checked.
I removed it.

bq. 2. The client shouldn't just give up and do nothing for InvalidToken. It's supposed to
try and get another token and retry the read. It's unclear if that actually happens or if
the position is retaining correctly?
In a unit test, the client does nothing for InvalidToken exception because security is turned
off. I updated the test to mock a token being replaced and then verified that retries happened.

Kind of a side note: this had to be verified outside of the retry policy verification that
I already had mocked up. {{runWithRetry}} doesn't use the retry policy when handling {{InvalidToken}}
exceptions. Instead, it tries to replace the token and, if it succeeds, retries the connection.
{{runWithRetry}} will retry as many times as that process succeeds with no upper limit. So,
I mocked replacing the token twice and then verified that {{ReadRunner#getResponse}} was called
3 times.

bq. 1.3. May consider more mockito verifies.
TestWebHDFS has been enhanced.

> WebHdfsFileSystem should retry reads according to the configured retry policy.
> ------------------------------------------------------------------------------
>                 Key: HDFS-7163
>                 URL: https://issues.apache.org/jira/browse/HDFS-7163
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: webhdfs
>    Affects Versions: 3.0.0, 2.5.1
>            Reporter: Eric Payne
>            Assignee: Eric Payne
>         Attachments: HDFS-7163-branch-2.003.patch, HDFS-7163-branch-2.004.patch, HDFS-7163-branch-2.7.003.patch,
HDFS-7163-branch-2.7.004.patch, HDFS-7163.001.patch, HDFS-7163.002.patch, HDFS-7163.003.patch,
HDFS-7163.004.patch, HDFS-7163.005.patch, WebHDFS Read Retry.pdf
> In the current implementation of WebHdfsFileSystem, opens are retried according to the
configured retry policy, but not reads. Therefore, if a connection goes down while data is
being read, the read will fail and the read will have to be retried by the client code.
> Also, after a connection has been established, the next read (or seek/read) will fail
and the read will have to be restarted by the client code.

This message was sent by Atlassian JIRA

View raw message