hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Deserialization issue.
Date Mon, 30 Jul 2012 15:53:21 GMT
Btw, do speak to Gora folks on fixing or at least documenting this
flaw. I can imagine others hitting the same issue :)

On Mon, Jul 30, 2012 at 9:22 PM, Harsh J <harsh@cloudera.com> wrote:
> I've mostly done it with logging, but this JIRA may interest you if
> you still wish to attach a remote debugger to tasks:
> https://issues.apache.org/jira/browse/MAPREDUCE-2637
>
> On Mon, Jul 30, 2012 at 7:28 PM, Sriram Ramachandrasekaran
> <sri.rams85@gmail.com> wrote:
>> Harsh,
>> I was waiting to try it on my cluster before I came back to report if it
>> worked or not.
>> I tried it and it works. The site wide configuration worked.
>> The IOUtils.conf.addResource("job.xml") does the same thing as
>> GoraMapReduceUtils.setIOSerialization(), so it did not help.
>>
>> Thanks for the help. I still would like to know, what would be a better way
>> to debug distributed map reduce jobs.
>> I know I can debug stand-alone jobs quite easily, but, I would like to know
>> how folks do distributed map reduce jobs debugging.
>>
>> Thanks again!
>> -Sriram
>>
>>
>> On Sat, Jul 28, 2012 at 6:20 AM, Sriram Ramachandrasekaran
>> <sri.rams85@gmail.com> wrote:
>>>
>>> aah! I always thought about setting io.serializations at the job level. I
>>> never thought about this. will try this site wide thing. thanks again.
>>>
>>> On 28 Jul 2012 06:16, "Harsh J" <harsh@cloudera.com> wrote:
>>>>
>>>> Ah, that may be cause the core-site.xml has the property
>>>> io.serializations fully defined for Gora as well? You can do that as
>>>> an alternative fix, supply a core-site.xml across tasktrackers that
>>>> also carry the serialization class Gora requires. I failed to think of
>>>> that as a solution.
>>>>
>>>> On Sat, Jul 28, 2012 at 6:04 AM, Sriram Ramachandrasekaran
>>>> <sri.rams85@gmail.com> wrote:
>>>> > okay. But this issue didn't present itself when run in standalone mode.
>>>> > :)
>>>> >
>>>> > On 28 Jul 2012 06:02, "Harsh J" <harsh@cloudera.com> wrote:
>>>> >>
>>>> >> I find it easier to run jobs via MRUnit (http://mrunit.apache.org,
>>>> >> TDD) first, or via LocalJobRunner, for debug purposes.
>>>> >>
>>>> >> On Sat, Jul 28, 2012 at 5:53 AM, Sriram Ramachandrasekaran
>>>> >> <sri.rams85@gmail.com> wrote:
>>>> >> > hello harsh,
>>>> >> > thanks for your investigations. while we were debugging, I
saw the
>>>> >> > exact
>>>> >> > thing. As you pointed out, we suspected it to be a problem.
So, we
>>>> >> > set
>>>> >> > the
>>>> >> > job conf object directly on Gora's query object.
>>>> >> > It goes something like this,
>>>> >> > query.setConf..(job.getConfig..())
>>>> >> >
>>>> >> > And, then I saw that it was not getting into creating a new
object
>>>> >> > at
>>>> >> > getOrCreate().
>>>> >> >
>>>> >> > OTOH, i've not tried the job.xml thing. I should give it a
try n I
>>>> >> > shall
>>>> >> > keep the loop posted.
>>>> >> >
>>>> >> > I would also like to hear about standard practices for debugging
>>>> >> > distributed
>>>> >> > MR tasks.
>>>> >> >
>>>> >> > -----
>>>> >> > reply from a hh device. Pl excuse typos n lack of formatting.
>>>> >> >
>>>> >> > On 28 Jul 2012 03:30, "Harsh J" <harsh@cloudera.com>
wrote:
>>>> >> >>
>>>> >> >> Hi Sriram,
>>>> >> >>
>>>> >> >> I suspect the following in Gora to somehow be causing this
issue:
>>>> >> >>
>>>> >> >> IOUtils source:
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> http://svn.apache.org/viewvc/gora/trunk/gora-core/src/main/java/org/apache/gora/util/IOUtils.java?view=markup
>>>> >> >> QueryBase source:
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> http://svn.apache.org/viewvc/gora/trunk/gora-core/src/main/java/org/apache/gora/query/impl/QueryBase.java?view=markup
>>>> >> >>
>>>> >> >> Notice that IOUtils.deserialize(…) calls expect a proper
>>>> >> >> Configuration
>>>> >> >> object. If not passed (i.e., if null), they call the following.
>>>> >> >>
>>>> >> >> 68        private static Configuration
>>>> >> >> getOrCreateConf(Configuration
>>>> >> >> conf)
>>>> >> >> {
>>>> >> >> 69          if(conf == null) {
>>>> >> >> 70            if(IOUtils.conf == null) {
>>>> >> >> 71              IOUtils.conf = new Configuration();
>>>> >> >> 72            }
>>>> >> >> 73          }
>>>> >> >> 74          return conf != null ? conf : IOUtils.conf;
>>>> >> >> 75        }
>>>> >> >>
>>>> >> >> Now QueryBase, has in its readFields method, some
>>>> >> >> IOUtils.deserialize(…) calls, that seem to pass a null
for the
>>>> >> >> configuration object. The IOUtils.deserialize(…) method
hence calls
>>>> >> >> this above method, and initializes a whole new Configuration
>>>> >> >> object,
>>>> >> >> as the passed conf object is null.
>>>> >> >>
>>>> >> >> If it does that, it would not be loading the "job.xml"
file
>>>> >> >> contents,
>>>> >> >> which is the job's config file (thats something the map
task's
>>>> >> >> config
>>>> >> >> set alone loads, and not a file thats loaded by default).
So hence,
>>>> >> >> custom serializers will disappear the moment it begins
using this
>>>> >> >> new
>>>> >> >> Configuration object.
>>>> >> >>
>>>> >> >> This is what you'll want to investigate and fix or notify
the Gora
>>>> >> >> devs about (why QueryBase#readFields uses a null object,
and if it
>>>> >> >> can
>>>> >> >> reuse some set conf object). As a cheap hack fix, maybe
doing the
>>>> >> >> following will make it work in an MR environment?
>>>> >> >>
>>>> >> >> IOUtils.conf = new Configuration();
>>>> >> >> IOUtils.conf.addResource("job.xml");
>>>> >> >>
>>>> >> >> I haven't tried the above, but let us know how we can be
of further
>>>> >> >> assistance. An ideal fix would be to only use the MapTask's
>>>> >> >> provided
>>>> >> >> Configuration object everywhere, somehow, and never re-create
one.
>>>> >> >>
>>>> >> >> P.s. If you want a thread ref link to share with other
devs over
>>>> >> >> Gora,
>>>> >> >> here it is: http://search-hadoop.com/m/BXZA4dTUFC
>>>> >> >>
>>>> >> >> On Fri, Jul 27, 2012 at 1:24 PM, Sriram Ramachandrasekaran
>>>> >> >> <sri.rams85@gmail.com> wrote:
>>>> >> >> > Hello,
>>>> >> >> > I have an MR job that talks to HBase. I use Gora to
talk to
>>>> >> >> > HBase.
>>>> >> >> > Gora
>>>> >> >> > also
>>>> >> >> > provides couple of classes which can be extended to
write Mappers
>>>> >> >> > and
>>>> >> >> > Reducers, if the mappers need input from an HBase
store and
>>>> >> >> > Reducers
>>>> >> >> > need to
>>>> >> >> > write it out to an HBase store. This is the reason
why I use
>>>> >> >> > Gora.
>>>> >> >> >
>>>> >> >> > Now, when I run my MR job, I get an exception as below.
>>>> >> >> > (https://issues.apache.org/jira/browse/HADOOP-3093)
>>>> >> >> > java.lang.RuntimeException: java.io.IOException:
>>>> >> >> > java.lang.NullPointerException
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.gora.mapreduce.GoraInputFormat.setConf(GoraInputFormat.java:115)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>>>> >> >> > at
>>>> >> >> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:723)
>>>> >> >> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>> >> >> > at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>> >> >> > at java.security.AccessController.doPrivileged(Native
Method)
>>>> >> >> > at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>>> >> >> > at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>> >> >> > Caused by: java.io.IOException: java.lang.NullPointerException
>>>> >> >> > at org.apache.gora.util.IOUtils.loadFromConf(IOUtils.java:483)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.gora.mapreduce.GoraInputFormat.getQuery(GoraInputFormat.java:125)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.gora.mapreduce.GoraInputFormat.setConf(GoraInputFormat.java:112)
>>>> >> >> > ... 9 more
>>>> >> >> > Caused by: java.lang.NullPointerException
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.serializer.SerializationFactory.getDeserializer(SerializationFactory.java:77)
>>>> >> >> > at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:205)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> > org.apache.gora.query.impl.QueryBase.readFields(QueryBase.java:234)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.DefaultStringifier.fromString(DefaultStringifier.java:75)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.DefaultStringifier.load(DefaultStringifier.java:133)
>>>> >> >> > at org.apache.gora.util.IOUtils.loadFromConf(IOUtils.java:480)
>>>> >> >> > ... 11 more
>>>> >> >> >
>>>> >> >> > I tried the following things to work through this
issue.
>>>> >> >> > 0. The stack trace indicates that, when setting up
a new Mapper,
>>>> >> >> > it
>>>> >> >> > is
>>>> >> >> > unable to deserialize something. (I could not get
to understand
>>>> >> >> > where
>>>> >> >> > it
>>>> >> >> > fails).
>>>> >> >> > 1. I looked around the forums and realized that serialization
>>>> >> >> > options
>>>> >> >> > are
>>>> >> >> > not getting passed, so, I tried setting up, io.serializations
>>>> >> >> > config
>>>> >> >> > on
>>>> >> >> > the
>>>> >> >> > job.
>>>> >> >> >    1.1. I am not setting up the "io.serializations"
myself, I use
>>>> >> >> > GoraMapReduceUtils.setIOSerializations() to do it.
I verified
>>>> >> >> > that,
>>>> >> >> > the
>>>> >> >> > confs are getting proper serializers.
>>>> >> >> > 2. I verified in the job xml to see if these confs
have got
>>>> >> >> > through,
>>>> >> >> > they
>>>> >> >> > were. But, it failed again.
>>>> >> >> > 3. I tried starting the hadoop job runner with debug
options
>>>> >> >> > turned
>>>> >> >> > on
>>>> >> >> > and
>>>> >> >> > in suspend mode, -XDebug suspend=y and I also set
the VM options
>>>> >> >> > for
>>>> >> >> > mapred
>>>> >> >> > child tasks, via the mapred.child.java.opts to see
if I can debug
>>>> >> >> > the
>>>> >> >> > VM
>>>> >> >> > that gets spawned newly. Although I get a message
on my stdout
>>>> >> >> > saying,
>>>> >> >> > opening port X and waiting, when I try to attach a
remote
>>>> >> >> > debugger on
>>>> >> >> > that
>>>> >> >> > port, it does not work.
>>>> >> >> >
>>>> >> >> > I understand that, when SerializationFactory tries
to deSerialize
>>>> >> >> > 'something', it does not find an appropriate unmarshaller
and so
>>>> >> >> > it
>>>> >> >> > fails.
>>>> >> >> > But, I would like to know a way to find that 'something'
and I
>>>> >> >> > would
>>>> >> >> > like to
>>>> >> >> > get some idea on how (pseudo) distributed MR jobs
should be
>>>> >> >> > generally
>>>> >> >> > debugged. I tried searching, did not find anything
useful.
>>>> >> >> >
>>>> >> >> > Any help/pointers would be greatly useful.
>>>> >> >> >
>>>> >> >> > Thanks!
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > It's just about how deep your longing is!
>>>> >> >> >
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> --
>>>> >> >> Harsh J
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Harsh J
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>
>>
>>
>>
>> --
>> It's just about how deep your longing is!
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Mime
View raw message