hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lohit <lohit...@yahoo.com>
Subject Re: Global Variables via DFS
Date Wed, 25 Jun 2008 16:43:22 GMT
As steve mentioned you could open up a HDFS file from within your map/reduce task.
Also instead of using DistributedFileSystem, you would actually use FileSystem. This is what
I do.

FileSystem fs = FileSystem.get( new Configuration() );
FSDataInputStream file = fs.open(new Path("/user/foo/jambajuice");

----- Original Message ----
From: Steve Loughran <stevel@apache.org>
To: core-user@hadoop.apache.org
Sent: Wednesday, June 25, 2008 9:15:55 AM
Subject: Re: Global Variables via DFS

javaxtreme wrote:
> Hello all,
> I am having a bit of a problem with a seemingly simple problem. I would like
> to have some global variable which is a byte array that all of my map tasks
> have access to. The best way that I currently know of to do this is to have
> a file sitting on the DFS and load that into each map task (note: the global
> variable is very small ~20kB). My problem is that I can't seem to load any
> file from the Hadoop DFS into my program via the API. I know that the
> DistributedFileSystem class has to come into play, but for the life of me I
> can't get it to work. 
> I noticed there is an initialize() method within the DistributedFileSystem
> class, and I thought that I would need to call that, however I'm unsure what
> the URI parameter ought to be. I tried "localhost:50070" which stalled the
> system and threw a connectionTimeout error. I went on to just attempt to
> call DistributedFileSystem.open() but again my program failed this time with
> a NullPointerException. I'm assuming that is stemming from he fact that my
> DFS object is not "initialized".
> Does anyone have any information on how exactly one programatically goes
> about loading in a file from the DFS? I would greatly appreciate any help.

If the data changes, this sounds more like the kind of data that a 
distributed hash table or tuple space should be looking after...sharing 
facts between nodes

1. what is the rate of change of the data?
2. what are your requirements for consistency?

If the data is static, then yes, a shared file works.  Here's my code 
fragments to work with one. You grab the URI from the configuration, 
then initialise the DFS with both the URI and the configuration.

     public static DistributedFileSystem 
createFileSystem(ManagedConfiguration conf) throws 
SmartFrogRuntimeException {
         String filesystemURL = 
         URI uri = null;
         try {
             uri = new URI(filesystemURL);
         } catch (URISyntaxException e) {
             throw (SmartFrogRuntimeException) SmartFrogRuntimeException
                     .forward(ERROR_INVALID_FILESYSTEM_URI + filesystemURL,
         DistributedFileSystem dfs = new DistributedFileSystem();
         try {
             dfs.initialize(uri, conf);
         } catch (IOException e) {
             throw (SmartFrogRuntimeException) SmartFrogRuntimeException
                     .forward(ERROR_FAILED_TO_INITIALISE_FILESYSTEM, e);

         return dfs;

As to what URLs work, try  "localhost:9000"; this works on machines 
where I've brought a DFS up on that port. Use netstat to verify your 
chosen port is live.

View raw message