hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Roelofs <roel...@yahoo-inc.com>
Subject Re: Thread safety issues with JNI/native code from map tasks?
Date Fri, 28 Jan 2011 23:50:57 GMT
Keith Wiley wrote:

> (1) Speculative execution would occur on a completely different
> node, so there definitely isn't any thread cross-talk (in memory).
> So long as they don't rely on reading/writing temp files from
> HDFS I don't see how they could have any effect on one another.

Good point.

> (2) I am also getting seg faults when I run in noncluster
> standalone mode, which is a single nonspeculated thread......I
> presume.

That's the same as "pseudo-distributed mode"?

> Can you explain your thoughts on speculative execution w.r.t. the
> problems I'm having?

Thoughts?  You expect me to have thoughts, too??


I had not fully thought through the spec ex idea; it was the only thing
I could think of that might put two (otherwise independent) JNI-using tasks
onto the same node.  But as you point out above, it wouldn't...

Does your .so depend on any other potentially thread-unsafe .so that other
(non-Hadoop) processes might be using?  System libraries like zlib are safe
(else they wouldn't make very good system libraries), but maybe some other
research library or something?  (That's a long shot, but I'm pretty much
grasping at straws here.)

> Yes, not thread safe, but what difference could that make if I
> don't use the library in a multi-threaded fashion.  One map task,
> one node, one Java thread calling JNI and using the native code?
> How do thread safety issues factor into this?  I admit, it's
> my theory that threads might be involved somehow, but I don't
> understand how, I'm just shooting in the dark since I can't
> solve this problem any other way yet.

Since you can reproduce it in standalone mode, can you enable core dumps
so you can see the backtrace of the code that segfaults?  Knowing what
specifically broke and how it got there is always a big help.

Btw, keep in mind that there are memory-related bugs that don't show up
until there's something big in memory that pushes the code in question
up into a region with different data patterns in it (most frequently zero
vs. non-zero, but others are possible).  IOW, maybe the code is dependent
on uninitialized memory, but you were getting lucky when you ran it outside
of Hadoop.  Have you run it through valgrind or Purify or similar?


View raw message