airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Davydov <>
Subject Re: Airflow Logging Improvements
Date Thu, 22 Jun 2017 18:38:18 GMT
I don't think Allison's PR fixes logging, but it's a step in the right
direction. The current approach creates an abstraction around reading logs,
whereas the final solution should define an interface for writings to logs
in addition to reading logs (which could indeed use something like for the writing
part). I agree we should move the logging towards something like log4j
(with context awareness of task id/dag id/execution date/attempt #). If
there are incompatibilities with this approach and the log4j solution (or
reasons why it would be difficult to port the PR over to the final model),
we should definitely address those concerns, but otherwise I still feel
this is a step in the right direction.

The concept of "attempt" is needed regardless of logging, the way retries
are stored/handled right now is not very sane.
- Old TI state is permanently deleted
- In the task logs you get "Try 1/6"... 2/6... 3/6... 1/6... 2/6 in logs
which doesn't make sense (if it's the Nth time the task is running it
should be logged as the Nth time). I recall other strange behaviors in
these log lines too (maybe something like Try 4/3).
- The "primary key" for a task instance run is not complete (which is what
Allison's logging change needs), you could say that TaskInstance should
only keep track of the latest TaskInstance run, but we still want to store
all tries for a task instance somewhere in the database, and we still need
to key this off of "attempt".

On Thu, Jun 22, 2017 at 11:00 AM, Allison Wang <>

> Hi Bolke,
> I agree that we should make logging configurable lbut I wouldn't think
> using handlers like python-elasticsearch-logger is a good idea over
> flushing logs into files. Here are some reasons:
>    1. Such handlers do not have the built-in backpressure-sensitive
>    protocol that can prevent overwhelm ElasticSearch.
>    2. Logs will be lost if ElasitcSearch cluster is down for reasons like
>    upgrading.
> In general, it's not a good practice for python logger talk directly to
> ElasticSearch. Flushing logs into files give us more flexibilities to use
> tools like Filebeat and Logstash to collect and index logs into
> ElasticSearch.
> Thanks,
> Allison
> On Thu, Jun 22, 2017 at 12:05 AM Bolke de Bruin <> wrote:
>> In the light of fixing logging, I would definitely appreciate written
>> design. Especially, as there have been multiple attempts to fix some issues
>> but these have been more like stop gap fixes.
>> In my opinion Airflow should not stipulate in a hard coded fashion where
>> and how logging takes place. It should behave more like ‘log4j’
>> configurations. So it should not just use “dag_id + task+id +
>> execution_date” and write this to an arbitrary location on the filesystem.
>> I could imagine a settings file “logging.conf” that setups something like
>> this:
>> [logger_scheduler]
>> level = INFO
>> handler = stderr
>> qualname = airflow.scheduler
>> formatter=scheduler_formatter
>> In airflow.cfg it should allow setting something like this:
>> [scheduler]
>> use_syslog = True
>> syslog_log_facility = LOG_LOCAL0
>> To allow logging to syslog so it can be moved to a centralised location
>> if required (syslog being a special case afaik).
>> Elasticsearch and any other backend can then just be a handler and we can
>> remove the custom stuff that is proposed in PR
>> incubator-airflow/pull/2380 <
>> incubator-airflow/pull/2380> by
>> python-elasticsearch-logger <
>> python-elasticsearch-logger> for example.
>> I then can be convinced to add something like “attempt”, but probably
>> there are more friendly ways to solve it at that time. In addition
>> ‘attempts' should then imho not be managed by the task or cli, but rather
>> by the executor as that is the process which “attempts” a task.
>> Bolke.
>> > On 22 Jun 2017, at 01:21, Dan Davydov <> wrote:
>> >
>> > Responding to some of Bolke's concerns in the github PR for this change:
>> >
>> > > Mmm still not convinced. Especially on elastic search it is just
>> easier to use the start_date to shard on.
>> > sharding on start_date isn't great because there is still some risk of
>> collisions and it means that we are coupling the primary key with
>> start_date unnecessarily (e.g. hypothetically you could allow two tasks to
>> run at the same in Airflow and in this case start_date would no longer be a
>> valid primary key), using monotonically increasing IDs for DB entries like
>> this is pretty standard practice.
>> >
>> > > In addition I'm very against the managing of log files this way. Log
>> files are already a mess and should be refactored to be consistent and to
>> be managed from one place.
>> > I agree about the logging mess, and there seem to have been efforts
>> attempting to fix this but they have all been abandoned so we decided to
>> move ahead with this change. I need to take a look at the PR first, but
>> this change should actually make logging less messy, since it should add an
>> abstraction for logging modules, and because you know exactly which try
>> numbers (and how many) ran on which workers from the file path. The log
>> folder structure already kind of mimicked the primary key of the
>> task_instance table (dag_id + task_id + execution_date), but really
>> try_number logically belongs in this key as well (at least for the key for
>> log files).
>> >
>> > > The docker packagers can already not package airflow correctly
>> without jumping through hoops. Arbitrarily naming it certainly does not
>> help here.
>> > If this is referring to the /<ATTEMPT #>/ in the path, I don't think
>> this is arbitrarily naming it. A log "unit" really should be a single task
>> run (not an arbitrary grouping of a variable number of multiple runs), and
>> each unit should have a unique key or location. One of the reasons we are
>> working on this effort is to actually make Airflow play nicer with
>> Kubernetes/Docker (since airflow workers should ideally be ephemeral), and
>> allowing a separate service to read and ship the logs is necessary in this
>> case since the logs will be destroyed along with the worker instance. I
>> think in the future we should also allow custom logging modules (e.g.
>> directly writing logs to some service).
>> >
>> >
>> > On Wed, Jun 21, 2017 at 3:11 PM, Allison Wang <
>> <>> wrote:
>> > Hi,
>> >
>> > I am in the process of making airflow logging backed by Elasticsearch
>> (more detail please check AIRFLOW-1325 <
>> jira/browse/AIRFLOW-1325>). Here are several more logging improvements
>> we are considering:
>> >
>> > 1. Log streaming. Auto-refresh the logs if tasks are running.
>> >
>> > 2. Separate logs by attempts.
>> >
>> > Instead of logging everything into one file, logs can be separated by
>> attempt number and displayed using tabs. Attempt number here is a
>> monotonically increasing number that represents each task instance run
>> (unlike try_number, clear task instance won't reset attempt number).
>> > try_number: n^th retry by the task instance. try_number should not be
>> greater than retries. Clear task will set try_number to 0.
>> > attempt: number of times current task instance got executed.
>> >
>> > 3. Collapsable logs. Collapse logs that are mainly for debugging
>> airflow internal and aren't really related to users' tasks (for example,
>> logs showed before "starting attempt 1 of 1")
>> >
>> > All suggestions are welcome.
>> >
>> > Thanks,
>> > Allison
>> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message