pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hersh Shafer <Hersh.Sha...@amdocs.com>
Subject RE: Problem when running our code with tez
Date Thu, 27 Aug 2015 08:45:15 GMT
+Shiri

-----Original Message-----
From: Daniel Dai [mailto:daijy@hortonworks.com] 
Sent: Wednesday, August 26, 2015 1:57 AM
To: dev@tez.apache.org; dev@pig.apache.org
Cc: Hersh Shafer; Almog Shunim
Subject: Re: Problem when running our code with tez

JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a way you can get
DagId within RecordWriter/OutputCommitter. A possible solution is to use conf.get(³mapreduce.workflow.id²)
+ conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific configuration and
only applicable if you run with Pig.

Daniel




On 8/25/15, 2:08 PM, "Hitesh Shah" <hitesh@apache.org> wrote:

>+dev@pig as this might be a question better answered by Pig developers.
>
>This probably won¹t answer your question but should give you some 
>background info. When Pig uses Tez, it may end up running multiple dags 
>within the same YARN application therefore the ³jobId² ( in case of MR, 
>job Id maps to the application Id from YARN ) may not be unique.
>Furthermore, there are cases where multiple vertices within the same 
>DAG could write to HDFS hence both dagId and vertexId are required to 
>guarantee uniqueness when writing to a common location.
> 
>thanks
>< Hitesh
>
>
>On Aug 25, 2015, at 7:29 AM, Shiri Marron <Shiri.Marron@amdocs.com> wrote:
>
>> Hi,
>> 
>> We are trying to run our existing workflows that contains pig 
>>scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are 
>>facing some problems when we run our code with tez.
>> 
>> In our code, we are writing and reading from/to a temp directory 
>>which we create with a name based on the  jobID:
>>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and 
>>in the close() -we take the jobID from TaskAttemptContext context. 
>>Meaning, each task writes a file to
>>           this  directory in the close () method according to the 
>>jobID from the context.
>>    Part 2 -  In the end of the whole job (after all the tasks were 
>>completed), we have our custom outputCommitter (which extends the
>> 
>>               org.apache.hadoop.mapreduce.OutputCommitter), and in 
>>the
>>commitJob()  it looks for that directory of the job and handles all 
>>the files under it-  the jobID is taken from JobContext
>>context.getJobID().toString()
>> 
>> 
>> 
>> We noticed that when we use tez, this mechanism doesn't work since 
>>the jobID from the tez task (part one ) is combined from the original 
>>job
>>id+vertex id , for example: 14404914675610 instead of 1440491467561 . 
>>id+So
>>the directory name in part 2 is different than part 1.
>> 
>> 
>> We looked for a way to retrieve only the vertex id or only the job id 
>>, but didn't find one - on the configuration the  property:
>> mapreduce.job.id also had the addition of the vertex id, and no other 
>>property value was equal to the original job id.
>> 
>> Can you please advise how can we solve this issue?  Is there a way to 
>>get the original jobID when we're in part 1?
>> 
>> Regards,
>> Shiri Marron
>> Amdocs
>> 
>> This message and the information contained herein is proprietary and 
>>confidential and subject to the Amdocs policy statement,  you may 
>>review at http://www.amdocs.com/email_disclaimer.asp
>
>


This message and the information contained herein is proprietary and confidential and subject
to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

Mime
View raw message