hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiong Deng <dbigb...@gmail.com>
Subject Problem with Python + Hadoop: how to link .so outside Python?
Date Thu, 01 Sep 2011 12:01:19 GMT

I have successfully installed scipy on my Python 2.7 on my local Linux, and
I want to pack my Python2.7 (with scipy) onto Hadoop and run my Python
MapReduce scripts,  like this:

 20 ${HADOOP_HOME}/bin/hadoop streaming \$
 21      -input "${input}" \$
 22      -output "${output}" \$
 23      -mapper "python27/bin/python27.sh rp_extractMap.py" \$
 24      -reducer "python27/bin/python27.sh rp_extractReduce.py" \$
 25      -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
 26      -file rp_extractMap.py \$
 27      -file rp_extractReduce.py \$
 28      -file shitu_conf.py \$
 29      -cacheArchive "/share/python27.tar.gz#python27" \$
 30      -outputformat org.apache.hadoop.mapred.TextOutputFormat \$
 31      -inputformat org.apache.hadoop.mapred.CombineTextInputFormat \$
 32      -jobconf mapred.max.split.size="512000000" \$
 33      -jobconf mapred.job.name="[reserve_price][rp_extract]" \$
 34      -jobconf mapred.job.priority=HIGH \$
 35      -jobconf mapred.job.map.capacity=1000 \$
 36      -jobconf mapred.job.reduce.capacity=200 \$
 37      -jobconf mapred.reduce.tasks=200$
 38      -jobconf num.key.fields.for.partition=2$

I have to do this, because the Hadoop server installed its own python of
very low version which may not support some of my python scripts, and I do
not have privilege to install scipy lib on that server. So,I have to use the
-cacheArchieve command to include my own python2.7 with scipy....

But, I find out that some of the .so in scipy are linked to other dynamic
libs outside Python2.7.. For example

$ ldd
        liblapack.so => /usr/local/atlas/lib/liblapack.so
        libatlas.so => /usr/local/atlas/lib/libatlas.so (0x0000002a95df3000)
        libgfortran.so.3 =>
/home/work/local/gcc-4.6.1/lib64/libgfortran.so.3 (0x0000002a9668d000)
        libm.so.6 => /lib64/tls/libm.so.6 (0x0000002a968b6000)
        libgcc_s.so.1 => /home/work/local/gcc-4.6.1/lib64/libgcc_s.so.1
        libquadmath.so.0 =>
/home/work/local/gcc-4.6.1/lib64/libquadmath.so.0 (0x0000002a96b51000)
        libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a96c87000)
        libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a96ebb000)
        /lib64/ld-linux-x86-64.so.2 (0x000000552aaaa000)

So, my question is: how can I include this libs? Should I search for all the
linked .so and .a under my local linux and pack them together with
Python2.7??? If yes, How can I get a full list of the libs needed and How
can make the packed Python2.7 know where to find the new libs??


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message