hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryabin, Thomas" <Tom.Rya...@McKesson.com>
Subject RE: How to make the query compiler not determine the number of reducers?
Date Mon, 30 Apr 2012 16:53:51 GMT
The query I am executing is:

select test_udf(name, store) from employees join stores;

 

My goal for this query is to run every combination of employees.name and
stores.store through my test_udf, and have Hadoop spread the computation
among the reducers. So if I have 5 rows in the "stores" table and 3 rows
in the "employees" table then there would be 15 combinations, and if I
had 3 reducers then ideally each reducer would get 5 combinations.

 

I created the tables with these commands:

create external table employees(row_key string, name string)

stored by 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'

with serdeproperties ("cassandra.columns.mapping" = ":key,name",

"cassandra.ks.name" = "test",

"cassandra.cf.name" = "employees");

 

create external table stores(row_key string, store string)

stored by 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'

with serdeproperties ("cassandra.columns.mapping" = ":key,store",

"cassandra.ks.name" = "test",

"cassandra.cf.name" = "stores");

 

I am using Cassandra as the storage mechanism. I have tried using the ON
operator with my query like so:

select test_udf(name, store) from employees join stores on
(employees.name = stores.store);

 

and in this case Hive creates 3 reduce tasks, but nothing gets done
because there are no matching keys. Is there a way to accomplish what I
am trying to do by using "distribute by", "cluster by", and/or bucketed
tables, or something else?

 

Thanks,

Thomas

 

 

From: Bejoy KS [mailto:bejoy_ks@yahoo.com] 
Sent: Monday, April 30, 2012 10:15 AM
To: user@hive.apache.org
Subject: Re: How to make the query compiler not determine the number of
reducers?

 

Thomas,

It needn't be the case, raising your map tasks may not have any effect
on reduce tasks. May be we can help you out if you could provide some
details like :
- the query you are executing
- describe formatted on the tables involved in query 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

  _____  

From: "Ryabin, Thomas" <Tom.Ryabin@McKesson.com> 

Date: Mon, 30 Apr 2012 10:06:01 -0400

To: <user@hive.apache.org>

ReplyTo: user@hive.apache.org 

Subject: RE: How to make the query compiler not determine the number of
reducers?

 

I tried using this to set the number of reduce tasks to 2, but it
doesn't work for me. In my case the Hive query always creates 8 map
tasks and 1 reduce task. Could the number of reduce tasks be limited by
the number of map tasks, so that if I wanted 2 reduce tasks I would need
to increase the number of map tasks to 16 in my case?

 

-Thomas

 

From: Bejoy KS [mailto:bejoy_ks@yahoo.com] 
Sent: Saturday, April 28, 2012 1:43 AM
To: user@hive.apache.org
Subject: Re: How to make the query compiler not determine the number of
reducers?

 

Hi Thomas
Hive automatically sets the number of reducers for you. But you can
easily override them at CLI. Before executing your query 
hive>SET mapred.reduce.tasks=n;

Where n is the required num of reducers.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

  _____  

From: "Ryabin, Thomas" <Tom.Ryabin@McKesson.com> 

Date: Fri, 27 Apr 2012 16:48:25 -0400

To: <user@hive.apache.org>

ReplyTo: user@hive.apache.org 

Subject: How to make the query compiler not determine the number of
reducers?

 

Hi,

 

When I run a query that uses a custom UDF I made, one of the lines it
prints out is:

Number of reduce tasks determined at compile time: 1

 

And this causes the MapReduce job to have only 1 reducer. Is there a way
to make it so the compiler does not determine the number of reduce tasks
to create, so I can specify the number myself?

 

The query in question is:

select test_udf(name, store) from employees join stores;

 

Thanks,

Thomas


Mime
View raw message