airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@apache.org>
Subject Re: Hadoop tasks - File Already Exists Exception
Date Mon, 16 May 2016 15:43:53 GMT
Hey Jelez,

Based on your stack trace, it sounds like you're using S3 as an HDFS
replacement for Hadoop. S3, by default, will allow you to overwrite a
file--your T2 shouldn't have an issue if it's using S3 directly:

http://stackoverflow.com/questions/9517198/can-i-update-an-existing-amazon-s3-object

However, given that you're interacting with S3 through Hadoop, it looks to
me like it's Hadoop that's preventing you from overwriting.

I am not terribly familiar with Scoop, but perhaps they have an "overwrite"
option? If not, then I can't really think of a way to handle this that's
idempotent, unless you couple the two operations together in a bash script,
as you described. Perhaps someone else has some ideas.

Cheers,
Chris

On Mon, May 16, 2016 at 8:25 AM, Raditchkov, Jelez (ETW) <
Jelez.Raditchkov@nike.com> wrote:

> Thank you Chris!
>
> I wanted to keep all tasks within the DAG so it is transparent and seems
> "the right" way to do. That is I have separate tasks for clean up and
> separate for executing sqoop.
>
> If I understand your response correctly I have to make a bash or python
> wrapper script that deletes the S3 file and then runs sqoop i.e combine T1
> and T2. This seems like hacky to me - in a way that those are different
> functionalities by different type of operators. By this logic I can just
> combine all my tasks into a single script and have a DAG of a single task.
>
> Please, advise if I am getting something wrong.
>
>
>
> -----Original Message-----
> From: Chris Riccomini [mailto:criccomini@apache.org]
> Sent: Monday, May 16, 2016 7:43 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Hadoop tasks - File Already Exists Exception
>
> Hey Jelez,
>
> The recommended way to handle this is to make your tasks idempotent. T2
> should overwrite the S3 file, not fail if it already exists.
>
> Cheers,
> Chris
>
> On Sun, May 15, 2016 at 11:42 AM, Raditchkov, Jelez (ETW) <
> Jelez.Raditchkov@nike.com> wrote:
>
> > I am running several dependent tasks:
> > T1 - delete S3 folder for
> > T2 - scoop from DB to the S3 folder
> >
> > Problem if T2 fails in the middle every retry then gets: Encountered
> > IOException running import job:
> > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> > s3://...
> >
> > Is there a way reattempt a group of tasks not only the T2 - the way it
> > is now the DAG fails because of S3 folder exists when it was created
> > by the failed T2 attempt and the DAG can never succeed.
> >
> > Any suggestions?
> >
> > Thanks!
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message