hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Clampffer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-13745) libhdfs++: Fix race in FileSystem destructor
Date Thu, 23 Aug 2018 19:06:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590702#comment-16590702

James Clampffer commented on HDFS-13745:

Thanks for checkout this out [~anatoli.shein].
{quote}Is there a possibility that some task executed by the IoService will run forever?
Yes.  Someone can pass in a callback that does a long sleep or busy wait.  There's plenty
of comments that say you should never pass in a callback that can block for an indeterminate
amount of time; there's nothing the library can do if someone chooses to ignore those.  All
of the internal tasks that the library runs in the ioservice context have timeouts to prevent
them from running forever.
{quote}Should we add some timeout in BlockingStop method if we have been waiting too long?
No. BlockingStop is only there to prevent a thread self-join (and only blocks if that's what
would happen otherwise).  The only thing exiting the loop early can do is let the self join
happen. On the surface it looks like you could spawn another thread and run the dtor there
but then you're stick with a similar issue when it comes to managing the lifetime of that
{quote}In the hdfs_ioservice_test in longRunningCallback we sleep for just 1 second, which
might not be enough since if there is some sort of system delay longer than 1 second the test
might fail. Even though with any amount of sleep there is a chance of this happening, it might
make sense to increase it to 2-3 seconds.
Increasing the sleep to 2 or 3 seconds seems just as arbitrary as a 1 second sleep. I'll see
if I can get rid of the sleep by adding an extra condition variable.
bq. Also, can we submit another CI run for this? Looks like the previous one didn't run for
some reason.
Yeah.  I'll do that once I add the condition variable.

> libhdfs++: Fix race in FileSystem destructor
> --------------------------------------------
>                 Key: HDFS-13745
>                 URL: https://issues.apache.org/jira/browse/HDFS-13745
>             Project: Hadoop HDFS
>          Issue Type: Task
>          Components: native
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>            Priority: Major
>         Attachments: HDFS-13745.000.patch
> Whatever happens to have the last shared_ptr to the IoService will run ~IoService when
the shared_ptr goes out of scope.  IoService's destructor is responsible for joining all
worker threads in the pool.  Most callbacks now own weak_ptr<IoService> that can be
promoted to a shared_ptr in order to post new async tasks.  If a callback object is the last
thing holding the IoService shared_ptr it's going to try to join the thread pool inside of
one of the thread pool's threads.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message