airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Meickle (JIRA)" <>
Subject [jira] [Created] (AIRFLOW-5238) Outage caused by interaction between SFTPSensor, SSHHook, and paramiko ProxyCommand
Date Fri, 16 Aug 2019 15:09:00 GMT
James Meickle created AIRFLOW-5238:

             Summary: Outage caused by interaction between SFTPSensor, SSHHook, and paramiko
                 Key: AIRFLOW-5238
             Project: Apache Airflow
          Issue Type: Bug
          Components: hooks
    Affects Versions: 1.10.4
            Reporter: James Meickle

We had a major outage last night due to an extremely subtle bug that requires a bit of explanation.
Please bear with me...

We have an extensive DAG for moving financial data files between vendors. One of these vendor
SFTPs is only accessible via certain subnets, so we use an SSH ProxyCommand to pass through
a bastion first. The existence of this bastion is a difference from staging, where we directly
connect to a mock SFTP (AWS Transfer).

After a recent code deployment, our production Airflow cluster began exhibiting intermittent
connectivity issues to a specific vendor SFTP. This connectivity issue affected _all_ SFTP
connections to the vendor, regardless of specific operator or sensor class. Some connections
would work, but most would fail with:
paramiko.ssh_exception.SSHException: No existing session
We first looked at the diff of changes to the DAG code. There was no apparent "shared" or
"global" configuration that would affect all connections to the vendor; for example, we didn't
change any hook classes.

One of our engineers built a minimal test case that instantiated an SFTP operator and immediately
executed it. This was run on the Airflow production server, and therefore had the same user,
virtualenv, metadata db, connection, and so on. This test case did not reproduce the original

We implemented a remote debugger to attach to the failing operators. The intermittent failure
still occurred when stepping through a debugger, though the underlying error was revealed
to be a Paramiko socket timeout. We didn't get much clarity on what was specifically failing.

We tried going a layer lower and just executing the SSH {{ProxyCommand}} in a terminal. This
also ran as expected manually, connecting to the vendor SSH through the bastion and receiving
an SSH banner awaiting change to SFTP protocol.

However, after _repeatedly_ running the manual command, we noticed that in some cases we immediately
received a server EOF, just before getting an SSH banner. These failures would be batched;
the command would work from all instances, then fail several times in a row (only from a single
instance), and then resume working. We began to suspect that we were being throttled.

We began to investigate resource use, wondering whether there was a process leak. We did notice
several old SSH processes running on the workers. Since they were marked as defunct, we would
not expect them to be holding connections open; just in case, we cycled all of the worker
servers. This did not resolve the situation.

At this point we took a break to manually run our ETL commands for the evening, and to preemptively
ask our vendor whether they had recently instituted throttling on their SFTP server. In the
morning we were still impacted by this issue and resumed debugging.

One of our engineers tried inspecting the state of our Airflow _web_ server and noticed that
it had many open connections to the remote SFTP. This turned out to be the critical clue that
narrowed it down to two likely causes: a hook that was initiating a connection outside of
{{get_conn}}, or an operator that was misusing a hook.

I began tracing through our custom code and did not find either of those problems. However,
in examining Airflow core I was able to find the code path that led to this outage:
 - We deploy new DAG tasks, which includes a {{SFTPSensor}}.
 - The SFTPSensor instantiates an SFTPHook in {{__init__}} instead of {{execute}}: []
 - The SFTPHook is a {{SSHHook}} subclass and uses its {{__init__}}: []
 - In the {{__init__}} method of the {{SSHHook}}, if an SSH config file is present and there's
a matching {{ProxyCommand}}, a new {{paramiko.ProxyCommand}} will be created: []
 - In the {{__init__}} method of {{paramiko.ProxyCommand}}, the SSH {{ProxyCommand}} gets
_executed immediately_ rather than deferred until use: []

Taken together, every DAG parsing attempt, including on scheduler and web instances, was creating
multiple SSH connections resulting in our getting throttled (by either the bastion or the
vendor SFTP). We didn't see this on staging because we don't use an SSH {{ProxyCommand}} there.

This outage was the result of multiple intersecting coincidences, but it's not the first time
we've seen hook issues in Airflow. To mitigate the blast radius on similar issues, I think
Airflow should:

- Standardize and document that operators only instantiate hooks at execute-time.
- Fix the SFTPSensor's issues with the above
- Standardize and document that hooks can't make any connections outside of get_conn.
- Fix the SSHHook's use of ProxyCommand
- File an upstream bug against Paramiko to make their ProxyCommand behavior less surprising,
and to improve logging of it. (I will do this.)

This message was sent by Atlassian JIRA

View raw message