hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Håvard Wahl Kongsgård <haavard.kongsga...@gmail.com>
Subject Re: Hadoop scripting when to use dfs -put
Date Wed, 15 Feb 2012 12:13:15 GMT
Sorry for cross posting again. There is still something strange with
the dfs client and python. With the very simple code below, I get no
errors, but no output in /tmp/bio_sci/

I could use FUSE, but this issue should be of general interest to
users of hadoop/ python users. Can anyone replicate this?

def multi_tree(value):
    os.system("hadoop dfs -touchz /tmp/bio_sci/"+str(value)+" >
/dev/null 2> /dev/null")

def mapper(key, value):
    v = value.split(" ")[0]
    yield multi_tree(v),1
if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper)

-Håvard


On Tue, Feb 14, 2012 at 3:01 PM, Harsh J <harsh@cloudera.com> wrote:
> For the sake of http://xkcd.com/979/, and since this was cross posted,
> Håvard managed to solve this specific issue via Joey's response at
> https://groups.google.com/a/cloudera.org/group/cdh-user/msg/c55760868efa32e2
>
> 2012/2/14 Håvard Wahl Kongsgård <haavard.kongsgaard@gmail.com>:
>> My environment heap size varies from 18GB to 2GB
>> in mapred-site.xml mapred.child.java.opts = -Xmx512M
>>
>> System Ubuntu 10.04 LTS, java-6-sun-1.6.0.26, ,latest cloudera version of hadoop
>>
>>
>> This log from the tasklog
>> Original exception was:
>> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
>>        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:376)
>>        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
>>        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
>>        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:264)
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>        at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:212)
>>        at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
>>        at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51)
>>        at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:418)
>>
>>
>> I don't have a recursive loop like while or something else
>>
>> my dumbo code
>>
>> multi_tree() is just a simple function
>>
>> where the error handling is
>> try:
>> except:
>> pass
>>
>> def mapper(key, value):
>>   v = value.split(" ")[0]
>>   yield multi_tree(v),1
>>
>>
>> if __name__ == "__main__":
>>   import dumbo
>>   dumbo.run(mapper)
>>
>>
>> -Håvard
>>
>>
>> On Mon, Feb 13, 2012 at 8:52 PM, Rohit <rohit@hortonworks.com> wrote:
>>> Hi,
>>>
>>> What threw the heap error? Was it the Java VM, or the shell environment?
>>>
>>> It would be good to look at free RAM memory on your system before and after you
ran the script as well, to see if your system is running low on memory.
>>>
>>> Are you using a recursive loop in your script?
>>>
>>> Thanks,
>>> Rohit
>>>
>>>
>>> Rohit Bakhshi
>>>
>>>
>>>
>>>
>>>
>>> www.hortonworks.com (http://www.hortonworks.com/)
>>>
>>>
>>>
>>>
>>>
>>> On Monday, February 13, 2012 at 10:39 AM, Håvard Wahl Kongsgård wrote:
>>>
>>>> Hi, I originally posted this on the dumbo forum, but it's more a
>>>> general scripting hadoop issue.
>>>>
>>>> When testing a simple script that created some local files
>>>> and then copied them to hdfs
>>>> with os.system("hadoop dfs -put /home/havard/bio_sci/file.json
>>>> /tmp/bio_sci/file.json")
>>>>
>>>> the tasks fail with out of heap memory. The files are tiny, and I have
>>>> tried increasing the
>>>> heap size. When skipping the hadoop dfs -put, the tasks do not fail.
>>>>
>>>> Is it wrong to use hadoop dfs -put inside running a script with
>>>> hadoop? Should I instead
>>>> transfer the files at the end with a combiner, or simply mount hdfs
>>>> locally and write directly to hdfs? Any general suggestions?
>>>>
>>>>
>>>> --
>>>> Håvard Wahl Kongsgård
>>>> NTNU
>>>>
>>>> http://havard.security-review.net/
>>>
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> NTNU
>>
>> http://havard.security-review.net/
>
>
>
> --
> Harsh J
> Customer Ops. Engineer
> Cloudera | http://tiny.cloudera.com/about



-- 
Håvard Wahl Kongsgård
NTNU

http://havard.security-review.net/

Mime
View raw message