From Karl Anderson <...@monkey.org>
Subject shipping streaming libraries with cacheArchive
Date Wed, 08 Oct 2008 22:15:46 GMT
Has anybody been able to ship a hadoop streaming library using
cacheArchive?  I am able to see my unjarred archive from my mapper,
but I'm not able to import Python files within it.

As a test, I'm jarring up a test directory and putting it on the HDFS:

   [root@domU-12-31-39-00-64-E2 ~]# ls jar_test
   __init__.py  __init__.pyc  bar.py  foo.py  foo.pyc
   [root@domU-12-31-39-00-64-E2 ~]# jar cvf jar_test.jar -C jar_test .
   [root@domU-12-31-39-00-64-E2 ~]# hadoop dfs -put jar_test.jar  

My test module is importable.

   [root@domU-12-31-39-00-64-E2 ~]# python
   Python 2.5.1 (r251:54863, Oct 30 2007, 13:54:11)
   [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
   Type "help", "copyright", "credits" or "license" for more  
   >>> import jar_test.foo

I include "-cacheArchive hdfs:///user/root/jar_test.jar#jar_test" in
my Hadoop streaming invocation.

My mapper is able to read the linked, extrated jar_test directory.
This prints "['foo.py', '.jar_test.jar.crc', 'jar_test.jar',
'__init__.py', 'META-INF', 'bar.py']" to the mapper output.

   #!/usr/bin/env python

   import sys
   import os

   #import jar_test.foo

   if __name__ == "__main__":
       for line in sys.stdin:
       print os.listdir('jar_test')

However, when I uncomment the import line, my mapper dies with
"ImportError: No module named jar_test.foo".

Any clues?

Karl Anderson

