hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9104) DFSInputStream goes into infinite loop
Date Fri, 18 Sep 2015 14:42:05 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aaron McCurry updated HDFS-9104:
--------------------------------
    Description: 
I recently have come across a bug that causes an infinite loop in the DFSClient.  I have experienced
this issue in hadoop 2.5.0 and the issue seems to present in 2.6.0.

The bug is hard to reproduce, it seems to only occurs when the NameNode is under great pressure
because I think it's a timing issue.

On the client side, a small file (100s of bytes or so) is written and then sync() is called.
 The depreciated sync because the code is setup to cross compile against hadoop 1 and hadoop
2.  After the sync is called the close happens on the outputstream in another thread async
to the writing thread.  This happens because the close call can be very time consuming.

Once the sync happens and the outputstream is handed off to the closing thread.  The writing
thread turns around and reads the output it has written and synced.  When this happens I believe
the client reads the length from the Namenode which appears to still be 0 (more on that in
a moment).

Once the inputstream is open and the first byte is trying to be read the DFSInputStream goes
into an infinite loop.  It appears to be error handling logical that is not handling all IOExceptions.

fetchBlockByteRange  => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L991

The loop occurs in the fetchBlockByteRange method, which catches all IOExceptions and just
recalls the actualGetFromOneDataNode method, assuming that method handles everything correctly.

actualGetFromOneDataNode  => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1025

In the actualGetFromOneDataNode inside the while loop it calls getBlockAt which throws a IOException
that is not handled by the actualGetFromOneDataNode method (the real issue).

actualGetFromOneDataNode calls getBlockAt =>
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1040

getBlockAt => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L406

In the getBlockAt method it checks that position to read are within the file length, which
I believe to still be zero at this point.  This is where I believe the IOException is thrown.

IOException => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L413

And because the IOException is not handled in the actualGetFromOneDataNode method and the
fetchBlockByteRange blindly recalls the actualGetFromOneDataNode method over and over again
the infinite loop is created.

My current work around is to wait until the file length is properly reported by the namenode
before opening the file.  Likely this is the correct choice regardless, but I think that client
should never go into an infinite loop during an error condition.


  was:
I recently have come across that causes an infinite loop in the DFSClient.  I have experienced
this issue in hadoop 2.5.0 and the issue seems to present in 2.6.0.

The bug is hard to reproduce, it seems to only occur when the NameNode is under great pressure
because I think it's a timing issue.

On the client side, a small file (100s of bytes) is written to and then sync() is called.
 The depreciated sync because the code is setup to cross compile hadoop 1 and hadoop 2.  After
the sync is called the close happens on the outputstream in another thread async to the writing
thread.  This happens because the close call can be very time consuming.

Once the sync happens and the outputstream is handed off to the closing thread.  The writing
thread turns around and reads the output it has written and synced.  When this happens I believe
the client reads the length from the Namenode which appears to still be 0 (more on that in
a moment).

Once the inputstream is open and the first byte is trying to be read the DFSInputStream goes
into an infinite loop.  It appears to be error handling logical not handling all IOExceptions.

fetchBlockByteRange  => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L991

The loop occurs in the fetchBlockByteRange method, which catches all IOExceptions and just
recalls the actualGetFromOneDataNode method, assuming that method handles everything correctly.

actualGetFromOneDataNode  => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1025

In the actualGetFromOneDataNode inside the while loop it calls getBlockAt which throws a IOException
that is not handled by the actualGetFromOneDataNode method.

actualGetFromOneDataNode calls getBlockAt =>
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1040

getBlockAt => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L406

In the getBlockAt method it checks that position to read are within the filelength, which
I believe to still be zero at this point.  This is where I believe the IOException is thrown.

IOException => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L413

And because the IOException is not handled in the actualGetFromOneDataNode method and the
fetchBlockByteRange blindly recalls the actualGetFromOneDataNode method over and over again
the infinite loop is created.

My current work around is to wait until the file length is properly reported by the namenode
before opening the file.  Likely this is the correct choice regarless, but I think that client
should never go into an infinite loop during an error condition.



> DFSInputStream goes into infinite loop
> --------------------------------------
>
>                 Key: HDFS-9104
>                 URL: https://issues.apache.org/jira/browse/HDFS-9104
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.5.0, 2.6.0
>            Reporter: Aaron McCurry
>
> I recently have come across a bug that causes an infinite loop in the DFSClient.  I have
experienced this issue in hadoop 2.5.0 and the issue seems to present in 2.6.0.
> The bug is hard to reproduce, it seems to only occurs when the NameNode is under great
pressure because I think it's a timing issue.
> On the client side, a small file (100s of bytes or so) is written and then sync() is
called.  The depreciated sync because the code is setup to cross compile against hadoop 1
and hadoop 2.  After the sync is called the close happens on the outputstream in another thread
async to the writing thread.  This happens because the close call can be very time consuming.
> Once the sync happens and the outputstream is handed off to the closing thread.  The
writing thread turns around and reads the output it has written and synced.  When this happens
I believe the client reads the length from the Namenode which appears to still be 0 (more
on that in a moment).
> Once the inputstream is open and the first byte is trying to be read the DFSInputStream
goes into an infinite loop.  It appears to be error handling logical that is not handling
all IOExceptions.
> fetchBlockByteRange  => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L991
> The loop occurs in the fetchBlockByteRange method, which catches all IOExceptions and
just recalls the actualGetFromOneDataNode method, assuming that method handles everything
correctly.
> actualGetFromOneDataNode  => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1025
> In the actualGetFromOneDataNode inside the while loop it calls getBlockAt which throws
a IOException that is not handled by the actualGetFromOneDataNode method (the real issue).
> actualGetFromOneDataNode calls getBlockAt =>
> https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1040
> getBlockAt => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L406
> In the getBlockAt method it checks that position to read are within the file length,
which I believe to still be zero at this point.  This is where I believe the IOException is
thrown.
> IOException => https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L413
> And because the IOException is not handled in the actualGetFromOneDataNode method and
the fetchBlockByteRange blindly recalls the actualGetFromOneDataNode method over and over
again the infinite loop is created.
> My current work around is to wait until the file length is properly reported by the namenode
before opening the file.  Likely this is the correct choice regardless, but I think that client
should never go into an infinite loop during an error condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message