hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Björn-Elmar Macek <ma...@cs.uni-kassel.de>
Subject Re: mortbay, huge files and the ulimit
Date Wed, 29 Aug 2012 14:32:06 GMT
Hi Christoph,

oh its so nice to hear from you! :) Nice greetings back from the gang! :)

Well i implemented it as i u said: it is a merge join:
Join attributes are used to define the partition the values are going to 
and the Comparators sort my data, so i can do a merge join and need only 
one set of data to look possible merge partners. I just use Hasmaps to 
get to the possible join partners faster.

Thank you for the tips nevertheless.

Best regards,
Elmar
Am 29.08.2012 16:23, schrieb Christoph Schmitz:
> Hi Elmar,
>
> I don't know about the technicalities of your problem, but why don't you use a reduce-side
join in the first place? (I.e., a sort-merge join instead of a hash join.)
>
> <short-explanation-skip-if-you-already-know>
> Assuming you want to self-join a table of the form (FROM, TO) in order to compute all
paths of length two, you'd emit each (FROM, TO) tuple twice, once with FROM as key and once
with TO as key, plus a tag indicating which is which:
>
> (1, 47), (1, 12) and (47, 15) become, in the Reducer:
>
> ((1, F), (1, 12))
> ((1, F), (1, 47))
> ((12, T), (1, 12))
> ((15, T), (47, 15))
> ((47, F), (47, 15))
> ((47, T), (1, 47))
>
> F, T indicate if the key was a FROM or a TO value.
>
> You'll make Hadoop group by the first component of the key, which will allow you to see
both of the "47" tuples that would be joined in one reducer. From those, you build the resulting
(1, 47, 15) tuple.
>
> (Detailed description in the O'Reilly/Tom White book.)
> </short-explanation-skip-if-you-already-know>
>
> Hope this helps (and kind regards to the Kassel gang ;-),
>
> Christoph
>
> -----Ursprüngliche Nachricht-----
> Von: Björn-Elmar Macek [mailto:macek@cs.uni-kassel.de]
> Gesendet: Mittwoch, 29. August 2012 15:54
> An: user@hadoop.apache.org
> Betreff: mortbay, huge files and the ulimit
>
> Hi there,
>
> i am currently running a job where i selfjoin a 63 gigabyte big csv file
> on 20 physically distinct nodes with 15GB each:
>
> While the mapping works just fine and is low cost, the reducer does the
> main work: holding a hashmap with elements to join with and finding join
> tuples for evry incoming key-value-pair.
>
> The jobs works perfectly on small files with 2 gigabytes, but starts to
> get "unstable" as the file size goes up: this becomes evident with a
> look into the tasktracker's logs saying:
>
> ERROR org.mortbay.log: /mapOutput
> java.lang.IllegalStateException: Committed
>       at org.mortbay.jetty.Response.resetBuffer(Response.java:1023)
>       at org.mortbay.jetty.Response.sendError(Response.java:240)
>       at
> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3945)
>       at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>       at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>       at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>       at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>       at
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835)
>       at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>       at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>       at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>       at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>       at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>       at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>       at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>       at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>       at org.mortbay.jetty.Server.handle(Server.java:326)
>       at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>       at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>       at
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
>       at
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>
>
> And while it is no problem at the beginning of the reduce process, where
> this happens only on a few nodes and rarely, it becomes crucial as the
> progress rises. The reason for this (afaik from reading articles), is
> that there are memory or file handle problems. I addressed the memory
> problem by conitiously purging the map of outdated elements evry 5
> million processed key-value-pairs. And i set  mapred.child.ulimit to
> 100000000 (ulimit in the shell tells me it is 400000000).
>
> Anyway i am still running into those mortbay errors and i start to
> wonder, if hadoop can manage the job with this algorithmn anyways. By
> pure naive math it should be:
> i explicily assigned 10GB memory to each JVM on each node and set
> mapred.child.java.opts to "-Xmx10240m -XX:+UseCompressedOops
> -XX:-UseGCOverheadLimit" (its a 64 bit environment and large
> datastructures cause the GC to throw exceptions). This would naively
> make 18 slave machines with 10GB each resulting in an overall memory of
> 180GB - three times as much as needed... i would think. So if the
> Partitioner distributes them just about equally to all nodes i should
> not run into any errors, do i?
>
> Can anybody help me with this issue?
>
> Best regards,
> Elmar
>
>


Mime
View raw message