hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Clampffer (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9890) libhdfs++: Add test suite to simulate network issues
Date Wed, 02 Mar 2016 19:51:18 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Clampffer updated HDFS-9890:
----------------------------------
    Description: 
I propose adding a test suite to simulate various network issues/failures in order to get
good test coverage on some of the retry paths that aren't easy to hit in unit tests.

At the moment the only things that hit the retry paths are the gmock unit tests.  The gmock
are only as good as their mock implementations which do a great job of simulating protocol
correctness but not more complex interactions.  They also can't really simulate the types
of lock contention and subtle memory stomps that show up while doing hundreds or thousands
of concurrent reads.

I'd like to make a standalone "bring your own cluster" test suite that can do things like
drop connections, slow connections down, and cause connections to hang for short periods of
time.  I think this should be a standalone test for a few reasons:
-The tools for doing this sort of thing are platform dependent.  On linux it looks like it
could be done with iptables but I'm not sure about mac or windows.  I can make linux version,
but I don't have enough windows and mac experience (or dev hardware) to be productive there.
-This needs to scale as large as possible for machines capable of doing it.  The CI tests
could run a dialed back version but chances of hitting bugs is much lower. There are certain
bugs that I've only been able to reproduce when running at sufficient scale.  My laptop with
4 physical cores and 1 disk can't sustain the loads that start making lock contention and
resource ownership gaps show up; running the client on a 24 core server against a "real" cluster
tends to make issues apparent quickly.
-As mentioned above, I think some of these bugs won't show up regardless of how long they
run on low end hardware e.g. typical dev workstation.  It's just not possible to get enough
parts moving at once.  I don't want people to waste time waiting for <some large number>
operations to run if it's only ever going to be running a few dozen concurrently.  I'm not
sure what sort of hardware the CI tests run on, but I don't think the rest of the Hadoop community
would appreciate a test that attempts to hog all resources for an extended period of time.

List of things to simulate(while heavily loaded), roughly in order of how badly I think they
need to be tested at the moment:
-Rpc connection disconnect
-Rpc connection slowed down enough to cause a timeout and trigger retry
-DN connection disconnect

The initial motivation for filing this is that I've hit a bug 2 times (ever) where the rpc
engine can't match a call-id with a request it sent out.  I have a guess as to what's causing
it, but not enough info to post a meaningful jira (haven't ruled out something else in the
process stomping on libhdfs memory).

  was:
I propose adding a test suite to simulate various network issues/failures in order to get
good test coverage on some of the retry paths that aren't easy to hit in unit tests.

At the moment the only things that hit the retry paths are the gmock unit tests.  The gmock
are only as good as their mock implementations which do a great job of simulating protocol
correctness but not more complex interactions.  They also can't really simulate the types
of lock contention and subtle memory stomps that show up while doing hundreds or thousands
of concurrent reads.

I'd like to make a standalone "bring your own cluster" test suite that can do things like
drop connections, slow connections down, and cause connections to hang for short periods of
time.  I think this should be a standalone test for a few reasons:
-The tools for doing this sort of thing are platform dependent.  On linux it looks like it
could be done with iptables but I'm not sure about mac or windows.
-This needs to scale as large as possible for machines capable of doing it.  The CI tests
could run a dialed back version but chances of hitting bugs is much lower. There are certain
bugs that I've only been able to reproduce when running at sufficient scale.  My laptop with
4 physical cores and 1 disk can't sustain the loads that start making lock contention and
resource ownership gaps show up; running the client on a 24 core server against a "real" cluster
tends to make issues apparent quickly.
-As mentioned above, I think some of these bugs won't show up regardless of how long they
run on low end hardware e.g. typical dev workstation.  It's just not possible to get enough
parts moving at once.  I don't want people to waste time waiting for <some large number>
operations to run if it's only ever going to be running a few dozen concurrently.  I'm not
sure what sort of hardware the CI tests run on, but I don't think the rest of the Hadoop community
would appreciate a test that attempts to hog all resources for an extended period of time.

List of things to simulate(while heavily loaded), roughly in order of how badly I think they
need to be tested at the moment:
-Rpc connection disconnect
-Rpc connection slowed down enough to cause a timeout and trigger retry
-DN connection disconnect

The initial motivation for filing this is that I've hit a bug 2 times (ever) where the rpc
engine can't match a call-id with a request it sent out.  I have a guess as to what's causing
it, but not enough info to post a meaningful jira (haven't ruled out something else in the
process stomping on libhdfs memory).


> libhdfs++: Add test suite to simulate network issues
> ----------------------------------------------------
>
>                 Key: HDFS-9890
>                 URL: https://issues.apache.org/jira/browse/HDFS-9890
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>
> I propose adding a test suite to simulate various network issues/failures in order to
get good test coverage on some of the retry paths that aren't easy to hit in unit tests.
> At the moment the only things that hit the retry paths are the gmock unit tests.  The
gmock are only as good as their mock implementations which do a great job of simulating protocol
correctness but not more complex interactions.  They also can't really simulate the types
of lock contention and subtle memory stomps that show up while doing hundreds or thousands
of concurrent reads.
> I'd like to make a standalone "bring your own cluster" test suite that can do things
like drop connections, slow connections down, and cause connections to hang for short periods
of time.  I think this should be a standalone test for a few reasons:
> -The tools for doing this sort of thing are platform dependent.  On linux it looks like
it could be done with iptables but I'm not sure about mac or windows.  I can make linux version,
but I don't have enough windows and mac experience (or dev hardware) to be productive there.
> -This needs to scale as large as possible for machines capable of doing it.  The CI tests
could run a dialed back version but chances of hitting bugs is much lower. There are certain
bugs that I've only been able to reproduce when running at sufficient scale.  My laptop with
4 physical cores and 1 disk can't sustain the loads that start making lock contention and
resource ownership gaps show up; running the client on a 24 core server against a "real" cluster
tends to make issues apparent quickly.
> -As mentioned above, I think some of these bugs won't show up regardless of how long
they run on low end hardware e.g. typical dev workstation.  It's just not possible to get
enough parts moving at once.  I don't want people to waste time waiting for <some large
number> operations to run if it's only ever going to be running a few dozen concurrently.
 I'm not sure what sort of hardware the CI tests run on, but I don't think the rest of the
Hadoop community would appreciate a test that attempts to hog all resources for an extended
period of time.
> List of things to simulate(while heavily loaded), roughly in order of how badly I think
they need to be tested at the moment:
> -Rpc connection disconnect
> -Rpc connection slowed down enough to cause a timeout and trigger retry
> -DN connection disconnect
> The initial motivation for filing this is that I've hit a bug 2 times (ever) where the
rpc engine can't match a call-id with a request it sent out.  I have a guess as to what's
causing it, but not enough info to post a meaningful jira (haven't ruled out something else
in the process stomping on libhdfs memory).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message