hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Balan <gabriel.ba...@oracle.com>
Subject Re: Implementing a custom StorageHandler
Date Wed, 06 Jul 2016 22:08:53 GMT
Hi

> What is the difference between org.apache.Hadoop.mapred.InputFormat and org.apache.hadoop.mapreduce.InputFormat?

There are two sets of APIs: the old (in the "mapred" package) and the new (in the "mapreduce"
package).

The old was deprecated, as the new was meant to replace it. But then the old API got undeprecated,
and now they're both maintained.

When building a /hadoop/ job, pick all bits and pieces (input format, mapper, reducer) from
the same API.

When dealing with /hive/, you want the /mapred/ API.

> How is numSpits calculated in org.apache.Hadoop.mapred.InputFormat.getSplits(JobConf
job, int numSplits)?
The numSplits arg is a hint, you can return a different number of splits.

The value of the numSplits arg comes from the conf, and you can set it with -D, -conf, or
through FileInputFormat.setNumMapTasks <https://hadoop.apache.org/docs/r2.6.2/api/src-html/org/apache/hadoop/mapred/JobConf.html#line.1335>(int
n):

> Set the number of map tasks for this job.
>
> /Note/: This is only a /hint/ to the framework. The actual number of spawned map tasks
depends on the number of |InputSplit| <https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputSplit.html>s
generated by the job's |InputFormat.getSplits(JobConf, int)| <https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputFormat.html#getSplits%28org.apache.hadoop.mapred.JobConf,%20int%29>.
A custom |InputFormat| <https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputFormat.html>
is typically used to accurately control the number of map tasks for the job.
>
Heads up #1. For mapred file input formats, the user specifies the number of splits. For mapreduce
file input formats, the user can control the number of splits by specifying the lower and
upper bounds on the size of splits.

Heads up #2. Most FileInputFormat implementations will give you 1 or more splits for each
file in the input set. Hive will try to use a /Combine/ input format, which combines small
files/splits into larger splits.

hth

Gabriel Balan

On 6/27/2016 6:59 PM, Long, Andrew wrote:
>
> Hello everyone,
>
> I’m in the process of implementing a custom StorageHandler and I had some questions.
>
> 1)What is the difference between org.apache.Hadoop.mapred.InputFormat and org.apache.hadoop.mapreduce.InputFormat?
>
> 2)How is numSpits calculated in org.apache.Hadoop.mapred.InputFormat.getSplits(JobConf
job, int numSplits)?
>
> 3)Is there a way to enforce a maximum number of splits?  What would happen if I ignore
numSplits and just returned an array of splits that was the actual maximum number of splits?
>
> 4)How is InputSplit.getLocations() used?  If I’m accessing non hfds resources should
what should I return?  Currently I’m just returning an empty array.
>
> Thanks for your time,
>
> Andrew Long
>

-- 
The statements and opinions expressed here are my own and do not necessarily represent those
of Oracle Corporation.


Mime
View raw message