hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shirish Tatikonda <shirish.tatiko...@gmail.com>
Subject Re: Mappers spawning Hive queries
Date Mon, 18 Apr 2016 21:43:31 GMT
I am using Hive 1.2.1 with MR backend.

Ryan, I hear you. I totally agree. This is not the best approach, and I am
in fact restructuring the approach.

However, I would like to understand what is going on. In my test run, each
hive query is computing *distinct* on a toy table of 10 records -- so, we
are definitely not running into problems like resource contention. Also, I
increased (streaming) mappers' task timeout value (to 1hr) so that I give
ample time for shell script (i.e., hive query) to finish. So,
architecturally, is there something that limits us spawning such nested MR
jobs -- a streaming MR job spawning multiple hive queries that in turn
spawn mr jobs.

Shirish


On Mon, Apr 18, 2016 at 1:31 PM, Ryan Harris <Ryan.Harris@zionsbancorp.com>
wrote:

> My $0.02....
>
>
>
> If you are running multiple concurrent queries on the data, you are
> probably doing it wrong (or at least inefficiently)....although this
> somewhat depends on what type of files are backing your hive warehouse...
>
>
>
> Let's assume that your data is NOT backed by ORC/parquet files, and that
> you are NOT using Tez/Spark as your execution engine....
>
>
>
> Generally with HDFS, data I/O is going to be the slowest piece....so, with
> your workflow, each hive query is going to need to read ALL of the source
> data to resolve the query.  It would be much more efficient if you could
> write a more complex query that could read the source data 1 time (instead
> of however many parallel operations you are running)....Additionally, from
> an efficiency perspective running queries in parallel might only help
> improve performance if each of your queries requires fewer map tasks than
> the total capacity of your cluster....otherwise it would  generally be more
> efficient to run your queries in series.
>
>
>
> If you schedule the work in series, and things get backed up, the job will
> still eventually complete.  If you attempt to do TOO much work in parallel,
> all of the jobs will start timing out and then everything will fail.
>
>
>
> There may be a valid reason for approaching the problem the way that you
> are, but I'd encourage you to look at restructuring your approach to the
> problem to save you more headaches down the road.
>
>
>
> *From:* Shirish Tatikonda [mailto:shirish.tatikonda@gmail.com]
> *Sent:* Monday, April 18, 2016 2:00 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Mappers spawning Hive queries
>
>
>
> Hi John,
>
>
>
> 2) The shell script is invoked in the mappers of a Hadoop streaming job.
>
>
>
> 1) The use case is that I have to process multiple entities in parallel.
> Each entity is associated with its own data set. The processing involves a
> few hive queries to do joins and aggregations, which is followed by some
> code in Python. My thought process is to put the hive queries and python
> invocation in a shell script, and invoke the shell script on multiple
> entities in parallel through a streaming mapreduce job.
>
>
>
> Shirish
>
>
>
>
>
> On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke <jornfranke@gmail.com>
> wrote:
>
> Just out of curiosity, what is the use case behind this?
>
> How do you call the shell script?
>
>
> > On 16 Apr 2016, at 00:24, Shirish Tatikonda <shirish.tatikonda@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I am trying to run multiple hive queries in parallel by submitting them
> through a map-reduce job.
> > More specifically, I have a map-only hadoop streaming job where each
> mapper runs a shell script that does two things -- 1) parses input lines
> obtained via streaming; and 2) submits a very simple hive query (via hive
> -e ...) with parameters computed from step-1.
> >
> > Now, when I run the streaming job, the mappers seem to be stuck and I
> don't know what is going on. When I looked on resource manager web UI, I
> don't see any new MR Jobs (triggered from the hive query). I am trying to
> understand this behavior.
> >
> > This may be a bad idea to begin with, and there may be better ways to
> accomplish the same task. However, I would like to understand the behavior
> of such a MR job.
> >
> > Any thoughts?
> >
> > Thank you,
> > Shirish
> >
>
>
> ------------------------------
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>

Mime
View raw message