hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei <faithlessfri...@gmail.com>
Subject Re: How to import custom Python module in MapReduce job?
Date Mon, 12 Aug 2013 10:01:33 GMT
Hi Binglin,

thanks for your explanation, now it makes sense. However, I'm not sure how
to implement suggested method with.

First of all, I found out that `-cachArchive` option is deprecated, so I
had to use `-archives` instead. I put my `lib.py` to directory `lib` and
then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
 linked it in call to Streaming API as follows:

  hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files main.py
*-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper "./main.py map"
-reducer "./main.py reduce" -combiner "./main.py combine" -input input
-output output

But script failed, and from logs I see that lib.jar hasn't been unpacked.
What am I missing?

On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <decstery@gmail.com> wrote:

> Hi,
> The problem seems to caused by symlink, hadoop uses file cache, so every
> file is in fact a symlink.
> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
> [root@master01 tmp]# ./main.py
> Traceback (most recent call last):
>   File "./main.py", line 3, in ?
>     import lib
> ImportError: No module named lib
> This should be a python bug: when using import, it can't handle symlink
> You can try to use a directory containing lib.py and use -cacheArchive,
> so the symlink actually links to a directory, python may handle this case
> well.
> Thanks,
> Binglin
> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <faithlessfriend@gmail.com> wrote:
>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>> )
>> I have a MapReduce job defined in file *main.py*, which imports module
>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>> Hadoop cluster as follows:
>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>     -files lib.py,main.py
>>     -mapper "./main.py map" -reducer "./main.py reduce"
>>     -input input -output output
>>  In my understanding, this should put both main.py and lib.py into *distributed
>> cache folder* on each computing machine and thus make module lib available
>> to main. But it doesn't happen - from log file I see, that files *are
>> really copied* to the same directory, but main can't import lib, throwing
>> *ImportError*.
>> Adding current directory to the path didn't work:
>> import sys
>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>> though, loading module manually did the trick:
>> import imp
>> lib = imp.load_source('lib', 'lib.py')
>>  But that's not what I want. So why Python interpreter can see other .py files
>> in the same directory, but can't import them? Note, I have already tried
>> adding empty __init__.py file to the same directory without effect.

View raw message