hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Schmitz <christoph.schm...@1und1.de>
Subject AW: mortbay, huge files and the ulimit
Date Wed, 29 Aug 2012 14:23:49 GMT
Hi Elmar,

I don't know about the technicalities of your problem, but why don't you use a reduce-side
join in the first place? (I.e., a sort-merge join instead of a hash join.)

<short-explanation-skip-if-you-already-know>
Assuming you want to self-join a table of the form (FROM, TO) in order to compute all paths
of length two, you'd emit each (FROM, TO) tuple twice, once with FROM as key and once with
TO as key, plus a tag indicating which is which:

(1, 47), (1, 12) and (47, 15) become, in the Reducer:

((1, F), (1, 12))
((1, F), (1, 47))
((12, T), (1, 12))
((15, T), (47, 15)) 
((47, F), (47, 15)) 
((47, T), (1, 47))

F, T indicate if the key was a FROM or a TO value.

You'll make Hadoop group by the first component of the key, which will allow you to see both
of the "47" tuples that would be joined in one reducer. From those, you build the resulting
(1, 47, 15) tuple.

(Detailed description in the O'Reilly/Tom White book.)
</short-explanation-skip-if-you-already-know>

Hope this helps (and kind regards to the Kassel gang ;-),

Christoph

-----Ursprüngliche Nachricht-----
Von: Björn-Elmar Macek [mailto:macek@cs.uni-kassel.de] 
Gesendet: Mittwoch, 29. August 2012 15:54
An: user@hadoop.apache.org
Betreff: mortbay, huge files and the ulimit

Hi there,

i am currently running a job where i selfjoin a 63 gigabyte big csv file 
on 20 physically distinct nodes with 15GB each:

While the mapping works just fine and is low cost, the reducer does the 
main work: holding a hashmap with elements to join with and finding join 
tuples for evry incoming key-value-pair.

The jobs works perfectly on small files with 2 gigabytes, but starts to 
get "unstable" as the file size goes up: this becomes evident with a 
look into the tasktracker's logs saying:

ERROR org.mortbay.log: /mapOutput
java.lang.IllegalStateException: Committed
     at org.mortbay.jetty.Response.resetBuffer(Response.java:1023)
     at org.mortbay.jetty.Response.sendError(Response.java:240)
     at 
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3945)
     at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
     at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
     at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
     at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
     at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835)
     at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
     at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
     at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
     at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
     at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
     at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
     at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
     at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
     at org.mortbay.jetty.Server.handle(Server.java:326)
     at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
     at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
     at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
     at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
     at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
     at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
     at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


And while it is no problem at the beginning of the reduce process, where 
this happens only on a few nodes and rarely, it becomes crucial as the 
progress rises. The reason for this (afaik from reading articles), is 
that there are memory or file handle problems. I addressed the memory 
problem by conitiously purging the map of outdated elements evry 5 
million processed key-value-pairs. And i set  mapred.child.ulimit to 
100000000 (ulimit in the shell tells me it is 400000000).

Anyway i am still running into those mortbay errors and i start to 
wonder, if hadoop can manage the job with this algorithmn anyways. By 
pure naive math it should be:
i explicily assigned 10GB memory to each JVM on each node and set 
mapred.child.java.opts to "-Xmx10240m -XX:+UseCompressedOops 
-XX:-UseGCOverheadLimit" (its a 64 bit environment and large 
datastructures cause the GC to throw exceptions). This would naively 
make 18 slave machines with 10GB each resulting in an overall memory of 
180GB - three times as much as needed... i would think. So if the 
Partitioner distributes them just about equally to all nodes i should 
not run into any errors, do i?

Can anybody help me with this issue?

Best regards,
Elmar


Mime
View raw message