hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Harris <>
Subject RE: Mappers spawning Hive queries
Date Mon, 18 Apr 2016 20:31:43 GMT
My $0.02....

If you are running multiple concurrent queries on the data, you are probably doing it wrong
(or at least inefficiently)....although this somewhat depends on what type of files are backing
your hive warehouse...

Let's assume that your data is NOT backed by ORC/parquet files, and that you are NOT using
Tez/Spark as your execution engine....

Generally with HDFS, data I/O is going to be the slowest, with your workflow,
each hive query is going to need to read ALL of the source data to resolve the query.  It
would be much more efficient if you could write a more complex query that could read the source
data 1 time (instead of however many parallel operations you are running)....Additionally,
from an efficiency perspective running queries in parallel might only help improve performance
if each of your queries requires fewer map tasks than the total capacity of your cluster....otherwise
it would  generally be more efficient to run your queries in series.

If you schedule the work in series, and things get backed up, the job will still eventually
complete.  If you attempt to do TOO much work in parallel, all of the jobs will start timing
out and then everything will fail.

There may be a valid reason for approaching the problem the way that you are, but I'd encourage
you to look at restructuring your approach to the problem to save you more headaches down
the road.

From: Shirish Tatikonda []
Sent: Monday, April 18, 2016 2:00 PM
Subject: Re: Mappers spawning Hive queries

Hi John,

2) The shell script is invoked in the mappers of a Hadoop streaming job.

1) The use case is that I have to process multiple entities in parallel. Each entity is associated
with its own data set. The processing involves a few hive queries to do joins and aggregations,
which is followed by some code in Python. My thought process is to put the hive queries and
python invocation in a shell script, and invoke the shell script on multiple entities in parallel
through a streaming mapreduce job.


On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke <<>>
Just out of curiosity, what is the use case behind this?

How do you call the shell script?

> On 16 Apr 2016, at 00:24, Shirish Tatikonda <<>>
> Hello,
> I am trying to run multiple hive queries in parallel by submitting them through a map-reduce
> More specifically, I have a map-only hadoop streaming job where each mapper runs a shell
script that does two things -- 1) parses input lines obtained via streaming; and 2) submits
a very simple hive query (via hive -e ...) with parameters computed from step-1.
> Now, when I run the streaming job, the mappers seem to be stuck and I don't know what
is going on. When I looked on resource manager web UI, I don't see any new MR Jobs (triggered
from the hive query). I am trying to understand this behavior.
> This may be a bad idea to begin with, and there may be better ways to accomplish the
same task. However, I would like to understand the behavior of such a MR job.
> Any thoughts?
> Thank you,
> Shirish

information that is privileged and exempt from disclosure under applicable law. If you are
neither the intended recipient nor responsible for delivering the message to the intended
recipient, please note that any dissemination, distribution, copying or the taking of any
action in reliance upon the message is strictly prohibited. If you have received this communication
in error, please notify the sender immediately.  Thank you.
View raw message