hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Vanzin (JIRA)" <>
Subject [jira] [Commented] (HIVE-8956) Hive hangs while some error/exception happens beyond job execution[Spark Branch]
Date Tue, 25 Nov 2014 19:28:13 GMT


Marcelo Vanzin commented on HIVE-8956:

This is ok if it unblocks something right now. For the code, I'd suggest using {{System.nanoTime()}}
to calculate durations, since it's monotonic. And use {{long}} instead of {{int}}.

But I think a better approach is needed here. Currently the {{JobSubmitted}} message seems
to only be sent when you use Spark's async APIs to submit a Spark job. A remote client job
that does not use those APIs would never generate that message. Also, the backend uses a thread
pool to execute jobs - so if you're queueing up multiple jobs, you may hit this timeout.

I think we need more fine-grained remote client-level events for tracking job progress. e.g.,
adding {{JobReceived}} and {{JobStarted}} messages would be a good start ({{JobResult}} already
covers the "job finished" case). I think these two extra messages should be enough to cover
the problems described in this bug.

> Hive hangs while some error/exception happens beyond job execution[Spark Branch]
> --------------------------------------------------------------------------------
>                 Key: HIVE-8956
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Chengxiang Li
>            Assignee: Rui Li
>              Labels: Spark-M3
>         Attachments: HIVE-8956.1-spark.patch
> Remote spark client communicate with remote spark context asynchronously, if error/exception
is throw out during job execution in remote spark context, it would be wrapped and send back
to remote spark client, but if error/exception is throw out beyond job execution, such as
job serialized failed, remote spark client would never know what's going on in remote spark
context, and it would hangs there.
> Set a timeout in remote spark client side may not a great idea, as we are not sure how
long the query executed in spark cluster. we need find a way to check whether job has failed(whole
life cycle) in remote spark context.

This message was sent by Atlassian JIRA

View raw message