hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankur Goel <ankur.g...@corp.aol.com>
Subject Re: Multithreaded reduce
Date Thu, 11 Sep 2008 07:06:30 GMT
This is exactly the case, my threads share the same output collector. So 
I don't create multiple instances of output collector myself rather use 
the one that is recieved in reduce().
Here is the stack trace:-

Exception in thread "pool-2-thread-1" java.lang.RuntimeException: Error 
while collecting output to HDFS
    at com.aol.urlDB.dbfacade.DiscWriteTask.run(DiscWriteTask.java:75)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
    at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create 
file 
/UrlStats/_temporary/_task_200809111042_0004_r_000000_0/URL/USER_ID/USER_ID_0 
for DFSClient_task_200809111042_0004_r_000000_0 on client 10.146.163.143 
because current leaseholder is trying to recreate file.
    at 
org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:1010)
    at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:967)
    at org.apache.hadoop.dfs.NameNode.create(NameNode.java:269)
    at sun.reflect.GeneratedMethodAccessor195.invoke(Unknown Source)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

    at org.apache.hadoop.ipc.Client.call(Client.java:557)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
    at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source)
    at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2189)
    at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:479)
    at 
org.apache.hadoop.dfs.DistributedFileSystem.create(DistributedFileSystem.java:138)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:508)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:408)
    at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:111)
    at 
org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:46)
    ... 3 more
Exception in thread "pool-2-thread-2" java.lang.RuntimeException: Error 
while collecting output to HDFS
...
...
...

Thanks
-Ankur

Lohit wrote:
> I might be wrong but my guess is this. This exception might be from the underneath layer
of dfs. Output creates a file and in your case there might me multiple create requests. Can
your threads share output collector? 
>
> Sent from my iPhone
>
> On Sep 8, 2008, at 12:51 AM, "Goel, Ankur" <ankur.goel@corp.aol.com> wrote:
>
> Hi Folks,
>
>             I have a setup where I am using a thread-pool
> implementation (provided by java.util.concurrent package) in reduce
> phase to do database I/O to speed up the database upload. The DB here is
> MySQL. I decided to go for additional parallelism via threads as 
>
> 1. It considerably speeds up the upload while consuming less cluster
> resources (i.e. less number of reducers required). 
>
> 2. The upload speed is not limited by the reduce task capacity of the
> cluster but by the DB's ability to handle max connections simultaneously
> and effectively.
>
>
>
> Each reduce task has 2 thread pools. One that does the DB I/O and whose
> return a java.util.concurrent.FutureTask. Another pool that fetches
> result from this future task to do disc I/O i.e.
> outputCollector.collect(...).
>
>
>
>   

> When multiple threads from the second pool try to do a disc I/O, I get
> an AlreadyBeingCreatedException in the logs. If I set the second pool to
> only have 1 thread then things work fine!
>
>
>
> It looks like the output collector was never assumed to be used from
> multiple threads.
>
>
>
> Any thoughts on this?
>
>
>
> Thanks
>
> -Ankur
>
>
>
>
>   


Mime
View raw message