hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Harris <Ryan.Har...@zionsbancorp.com>
Subject RE: Mappers spawning Hive queries
Date Mon, 18 Apr 2016 22:38:18 GMT
I'm not aware of any particular reason that this shouldn't "inherently" work, but for debugging
purposes I'd be wondering about the nested environment variables related to the hadoop job.....the
bash shell where you are trying to launch subsequent hive queries already has pre-existing
hadoop job environment variables declared in the environment from the parent streaming job.....I
can't say for sure that there wouldn't be conflicts there.  So while I don't know of any reason
that it definitely won't work, I know that you are venturing into uncharted territory and
you may uncover unexpected edge-cases.


From: Shirish Tatikonda [mailto:shirish.tatikonda@gmail.com]
Sent: Monday, April 18, 2016 3:44 PM
To: user@hive.apache.org
Subject: Re: Mappers spawning Hive queries

I am using Hive 1.2.1 with MR backend.

Ryan, I hear you. I totally agree. This is not the best approach, and I am in fact restructuring
the approach.

However, I would like to understand what is going on. In my test run, each hive query is computing
distinct on a toy table of 10 records -- so, we are definitely not running into problems like
resource contention. Also, I increased (streaming) mappers' task timeout value (to 1hr) so
that I give ample time for shell script (i.e., hive query) to finish. So, architecturally,
is there something that limits us spawning such nested MR jobs -- a streaming MR job spawning
multiple hive queries that in turn spawn mr jobs.

Shirish


On Mon, Apr 18, 2016 at 1:31 PM, Ryan Harris <Ryan.Harris@zionsbancorp.com<mailto:Ryan.Harris@zionsbancorp.com>>
wrote:
My $0.02....

If you are running multiple concurrent queries on the data, you are probably doing it wrong
(or at least inefficiently)....although this somewhat depends on what type of files are backing
your hive warehouse...

Let's assume that your data is NOT backed by ORC/parquet files, and that you are NOT using
Tez/Spark as your execution engine....

Generally with HDFS, data I/O is going to be the slowest piece....so, with your workflow,
each hive query is going to need to read ALL of the source data to resolve the query.  It
would be much more efficient if you could write a more complex query that could read the source
data 1 time (instead of however many parallel operations you are running)....Additionally,
from an efficiency perspective running queries in parallel might only help improve performance
if each of your queries requires fewer map tasks than the total capacity of your cluster....otherwise
it would  generally be more efficient to run your queries in series.

If you schedule the work in series, and things get backed up, the job will still eventually
complete.  If you attempt to do TOO much work in parallel, all of the jobs will start timing
out and then everything will fail.

There may be a valid reason for approaching the problem the way that you are, but I'd encourage
you to look at restructuring your approach to the problem to save you more headaches down
the road.

From: Shirish Tatikonda [mailto:shirish.tatikonda@gmail.com<mailto:shirish.tatikonda@gmail.com>]
Sent: Monday, April 18, 2016 2:00 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Mappers spawning Hive queries

Hi John,

2) The shell script is invoked in the mappers of a Hadoop streaming job.

1) The use case is that I have to process multiple entities in parallel. Each entity is associated
with its own data set. The processing involves a few hive queries to do joins and aggregations,
which is followed by some code in Python. My thought process is to put the hive queries and
python invocation in a shell script, and invoke the shell script on multiple entities in parallel
through a streaming mapreduce job.

Shirish


On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke <jornfranke@gmail.com<mailto:jornfranke@gmail.com>>
wrote:
Just out of curiosity, what is the use case behind this?

How do you call the shell script?

> On 16 Apr 2016, at 00:24, Shirish Tatikonda <shirish.tatikonda@gmail.com<mailto:shirish.tatikonda@gmail.com>>
wrote:
>
> Hello,
>
> I am trying to run multiple hive queries in parallel by submitting them through a map-reduce
job.
> More specifically, I have a map-only hadoop streaming job where each mapper runs a shell
script that does two things -- 1) parses input lines obtained via streaming; and 2) submits
a very simple hive query (via hive -e ...) with parameters computed from step-1.
>
> Now, when I run the streaming job, the mappers seem to be stuck and I don't know what
is going on. When I looked on resource manager web UI, I don't see any new MR Jobs (triggered
from the hive query). I am trying to understand this behavior.
>
> This may be a bad idea to begin with, and there may be better ways to accomplish the
same task. However, I would like to understand the behavior of such a MR job.
>
> Any thoughts?
>
> Thank you,
> Shirish
>

________________________________
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain
information that is privileged and exempt from disclosure under applicable law. If you are
neither the intended recipient nor responsible for delivering the message to the intended
recipient, please note that any dissemination, distribution, copying or the taking of any
action in reliance upon the message is strictly prohibited. If you have received this communication
in error, please notify the sender immediately. Thank you.


======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain
information that is privileged and exempt from disclosure under applicable law. If you are
neither the intended recipient nor responsible for delivering the message to the intended
recipient, please note that any dissemination, distribution, copying or the taking of any
action in reliance upon the message is strictly prohibited. If you have received this communication
in error, please notify the sender immediately.  Thank you.
Mime
View raw message