airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From George Leslie-Waksman <geo...@cloverhealth.com.INVALID>
Subject Re: Task partitioning using Airflow
Date Wed, 09 Aug 2017 17:10:39 GMT
Airflow is best for situations where you want to run different tasks that
depend on each other or process data that arrives over time. If your goal
is to take a large dataset, split it up, and process chunks of it, there
are probably other tools better suited to your purpose.

Off the top of my head, you might consider Dask:
https://dask.pydata.org/en/latest/ or directly using Celery:
http://www.celeryproject.org/

--George

On Wed, Aug 9, 2017 at 9:52 AM Ashish Rawat <ashish.rawat@myntra.com> wrote:

> Hi - Can anyone please provide some pointers for this use case over
> Airflow?
>
> --
> Regards,
> Ashish
>
>
>
> > On 03-Aug-2017, at 9:13 PM, Ashish Rawat <ashish.rawat@myntra.com>
> wrote:
> >
> > Hi,
> >
> > We have a use case where we are running some R/Python based data science
> models, which execute over a single node. The execution time of the models
> is constantly increasing and we are now planning to split the model
> training by a partition key and distribute the workload over multiple
> machines.
> >
> > Does Airflow provide some simple way to split a task into multiple
> tasks, all of which will work on a specific value of the key.
> >
> > --
> > Regards,
> > Ashish
> >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message