Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Thu, 16 Mar 2017 17:42:41 +0000 (UTC)
From: "Daniel Dai (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13043729.1487273433000.40218.1489686161761@Atlassian.JIRA>
In-Reply-To: <JIRA.13043729.1487273433000@Atlassian.JIRA>
References: <JIRA.13043729.1487273433000@Atlassian.JIRA> <JIRA.13043729.1487273433913@jira-lw-us.apache.org>
Subject: [jira] [Updated] (HIVE-15947) Enhance Templeton service job
 operations reliability
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Thu, 16 Mar 2017 17:42:51 -0000


     [ https://issues.apache.org/jira/browse/HIVE-15947?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated HIVE-15947:
------------------------------
       Resolution: Fixed
     Hadoop Flags: Reviewed
    Fix Version/s: 2.2.0
           Status: Resolved  (was: Patch Available)

+1ed on RB.

Precommit test fail to publish result, there is one unrelated failure: org.=
apache.hive.service.server.TestHS2HttpServer.testContextRootUrlRewrite. Oth=
er tests all pass. Link: https://builds.apache.org/job/PreCommit-HIVE-Build=
/4180/

Patch pushed to master. Thanks Subramanyam, Kiran!

> Enhance Templeton service job operations reliability
> ----------------------------------------------------
>
>                 Key: HIVE-15947
>                 URL: https://issues.apache.org/jira/browse/HIVE-15947
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Subramanyam Pattipaka
>            Assignee: Subramanyam Pattipaka
>             Fix For: 2.2.0
>
>         Attachments: HIVE-15947.10.patch, HIVE-15947.2.patch, HIVE-15947.=
3.patch, HIVE-15947.4.patch, HIVE-15947.6.patch, HIVE-15947.7.patch, HIVE-1=
5947.8.patch, HIVE-15947.9.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation requ=
ests. It simply accepts and tries to run all operations. If more number of =
concurrent job submit requests comes then the time to submit job operations=
 can increase significantly. Templetonused hdfs to store staging file for j=
ob. If HDFS storage can't respond to large number of requests and throttles=
 then the job submission can take very large times in order of minutes.
> This behavior may not be suitable for all applications and client applica=
tions  may be looking for predictable and low response for successful reque=
st or send throttle response to client to wait for some time before re-requ=
esting job operation.
> In this JIRA, I am trying to address following job operations=20
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of=
 cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs=
 which controls maximum number of concurrent active job submissions within =
Templeton and use this config to control better response times. If a new jo=
b submission request sees that there are already templeton.job.submit.exec.=
max-procs jobs getting submitted concurrently then the request will fail wi=
th Http error 503 with reason=20
>    =E2=80=9CToo many concurrent job submission requests received. Please =
wait for some time before retrying.=E2=80=9D
> =20
> The client is expected to catch this response and retry after waiting for=
 some time. The default value for the config templeton.job.submit.exec.max-=
procs is set to =E2=80=980=E2=80=99. This means by default job submission r=
equests are always accepted. The behavior needs to be enabled based on requ=
irements.
> We can have similar behavior for Status and List operations with configs =
templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs r=
espectively.
> Once the job operation is started, the operation can take longer time. Th=
e client which has requested for job operation may not be waiting for indef=
inite amount of time. This work introduces configurations
> templeton.exec.job.submit.timeout
> templeton.exec.job.status.timeout
> templeton.exec.job.list.timeout
> to specify maximum amount of time job operation can execute. If time out =
happens then list and status job requests returns to client with message
> "List job request got timed out. Please retry the operation after waiting=
 for some time."
> If submit job request gets timed out then=20
>       i) The job submit request thread which receives time out will check=
 if valid job id is generated in job request.
>       ii) If it is generated then issue kill job request on cancel thread=
 pool. Don't wait for operation to complete and returns to client with time=
 out message.=20
> Side effects of enabling time out for submit operations
> 1) This has a possibility for having active job for some time by the clie=
nt gets response and a list operation from client could potential show the =
newly created job before it gets killed.
> 2) We do best effort to kill the job and no guarantees. This means there =
is a possibility of duplicate job created. One possible reason for this cou=
ld be a case where job is created and then operation timed out but kill req=
uest failed due to resource manager unavailability. When resource manager r=
estarts, it will restarts the job which got created.
> Fixing this scenario is not part of the scope of this JIRA. The job opera=
tion functionality can be enabled only if above side effects are acceptable=
.


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)