crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-486) Join with custom Writable PType registered using Writables.registerComparable NPEs during shuffle
Date Wed, 07 Jan 2015 23:29:35 GMT


Josh Wills commented on CRUNCH-486:

Wow, that was an awesome JIRA report-- thanks! Fixing the small secondary issue should be
no problem; I'll try the mapreduce.job.output.key.comparator.class trick and see if it allows
us to hack around this, if only for 2.0 through 2.4 versions.

> Join with custom Writable PType registered using Writables.registerComparable NPEs during
> -------------------------------------------------------------------------------------------------
>                 Key: CRUNCH-486
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.11.0
>            Reporter: Brandon Vargo
>            Assignee: Josh Wills
>            Priority: Minor
> When joining two PTables on a key that is a custom writable PType, the shuffler will
fail with the following NullPointerException under Hadoop2 if the custom type has been registered
using Writables.registerComparable. This happens regardless of whether a specific integer
code is provided or the default hashCode()-based value is used.
> {noformat}
> org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
Error while doing final merge 
> 	at
> 	at
> 	at org.apache.hadoop.mapred.YarnChild$
> 	at Method)
> 	at
> 	at
> 	at org.apache.hadoop.mapred.YarnChild.main(
> Caused by: java.lang.NullPointerException
> 	at java.lang.Class.isAssignableFrom(Native Method)
> 	at org.apache.crunch.types.writable.TupleWritable$Comparator.compareField(
> 	at org.apache.crunch.types.writable.TupleWritable$
> 	at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(
> 	at org.apache.hadoop.util.PriorityQueue.upHeap(
> 	at org.apache.hadoop.util.PriorityQueue.put(
> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(
> 	at org.apache.hadoop.mapred.Merger.merge(
> 	at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.finalMerge(
> 	at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.close(
> 	at
> 	... 6 more
> {noformat}
> It appears that the Writables.WRITABLE_CODES entries are not deserialized from the configuration
during the shuffle phase of a join until TupleWritable.setConf() is called. However, because
TupleWritable.Comparator is registered as a raw comparator for TupleWritable, the shuffler
uses the comparator without instantiating or configuring a TupleWritable instance. As a result,
the type codes for the custom types are not available when the comparator starts to run.
> HADOOP-10686 made WritableComparator implement Configurable, but this was not released
until Hadoop 2.5. If I build Crunch against Hadoop 2.5 and copy TupleWritable's setConf()
function to TupleWritable.Comparator, then the shuffle works as expected. However, since Crunch
currently targets Hadoop 2.2, this does not work for the current version of Crunch.
> As as a workaround, it appears that if the {{mapreduce.job.output.key.comparator.class}}
property is set in the configuration, then the instance is created in JobConf.getOutputKeyComparator()
using ReflectionUtils instead of using the WritableComparator registration. ReflectionUtils
will pass the configuration to anything that implements Configurable, so setting {{mapreduce.job.output.key.comparator.class}}
to TupleWritable.Comparator and implementing Configurable might work for Hadoop versions older
than 2.5. I have yet to try this, though, and I have not looked into Hadoop1 to see if this
would also work there.
> If the shuffle is able to register the type codes via either method above, then there
is one small secondary issue that I hit: Writables.registerComparable checks if the type code
is already present in the map; if the type code is already in use, then it throws an exception,
even if the class being registered is the same as the existing class. With the type codes
being initialized during the shuffle phase, any later call to registerComparable for the same
type code and class will fail. I currently have my registerComparable call in a static initialization
block for my PType, so it is called whenever my writable type is first used under Crunch;
in this case, it happens when the reduce phase starts. Checking to see if the class being
registered and the existing class are equal inside of registerComparable before throwing an
error, similar to the one that is in Guava's AbstractBiMap, prevents this exception from being
> The above was happening using 0.11.0-hadoop2 on Hadoop 2.5.0 (CDH 5.2). The modifications
I mention above were made on top of {{d4f23c4}} and also tested on CDH 5.2.

This message was sent by Atlassian JIRA

View raw message