hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: mapred.map.tasks getting set, but not sure where
Date Fri, 04 Nov 2011 18:11:48 GMT
Could just be that Cassandra has changed the way their splits generate? Was Cassandra client
libs changed at any point? Have you looked at its input formats' sources?

On 04-Nov-2011, at 10:05 PM, Brendan W. wrote:

> Plain Java MR, using the Cassandra inputFormat to read out of Cassandra.
> Perhaps somebody hacked the inputFormat code on me...
> But what's weird is that the parameter mapred.map.tasks didn't appear in
> the job confs before at all.  Now it does, with a value of 20 (happens to
> be the # of machines in the cluster), and that's without the jobs or the
> mapred-site.xml files themselves changing.
> The inputSplitSize is set specifically in the jobs, and has not been
> changed (except I subsequently fiddled with it a little to see if it
> affected the fact that I was getting 20 splits, and it didn't affect
> that...just the split size, not the number).
> After a submit the job, I get a message "TOTAL NUMBER OF SPLIT = 20",
> before a list of the input splits...sort of looks like a hack but I can't
> find where it is.
> On Fri, Nov 4, 2011 at 11:58 AM, Harsh J <harsh@cloudera.com> wrote:
>> Brendan,
>> Are these jobs (whose split behavior has changed) via Hive/etc. or plain
>> Java MR?
>> In case its the former, do you have users using newer versions of them?
>> On 04-Nov-2011, at 8:03 PM, Brendan W. wrote:
>>> Hi,
>>> In the jobs running on my cluster of 20 machines, I used to run jobs (via
>>> "hadoop jar ...") that would spawn around 4000 map tasks.  Now when I run
>>> the same jobs, that number is 20; and I notice that in the job
>>> configuration, the parameter mapred.map.tasks is set to 20, whereas it
>>> never used to be present at all in the configuration file.
>>> Changing the input split size in the job doesn't affect this--I get the
>>> size split I ask for, but the *number* of input splits is still capped at
>>> 20--i.e., the job isn't reading all of my data.
>>> The mystery to me is where this parameter could be getting set.  It is
>> not
>>> present in the mapred-site.xml file in <hadoop home>/conf on any machine
>> in
>>> the cluster, and it is not being set in the job (I'm running out of the
>>> same jar I always did; no updates).
>>> Is there *anywhere* else this parameter could possibly be getting set?
>>> I've stopped and restarted map-reduce on the cluster with no
>> effect...it's
>>> getting re-read in from somewhere, but I can't figure out where.
>>> Thanks a lot,
>>> Brendan

View raw message