hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michel Tourn" <mic...@yahoo-inc.com>
Subject RE: URIs and hadoop
Date Thu, 31 Aug 2006 18:49:55 GMT
On URIs:

I had to learn more about URIs while looking at WebDAV code...
I am starting to like them.

Below scheme "file:" is really for local files 
Hadoop Path-s would use scheme "hdfs:"

Some developers like named pipes.
You can write to an existing named pipe from Java.
But this is not supported well in Java and Windows 
(cygwin named pipes only work between Cygwin applications)

So I also added support for a socket endpoint. 
To connect them:
use nc -l -p 123 and -mapsideoutput socket://localhost:123

All these variations unify well using standard URI syntax.

The reason you may want to use a socket or named-pipe as 
your map output:
to do a huge streaming computation: 
get all you part-0000k out of HDFS and 
process them on-the-fly in global order
"from the comfort of your home"

-Michel

-----------
With hadoopStreaming syntax:

  -input "+/in/part-00000 | /in/part-00001 | .. "

To specify a single side-effect output file:
  
  -mapsideoutput [file:/C:/win|file:/unix|socket://host:port]

  If the jobtracker is local it makes sense to use a local file
  This currently requires -reducer NONE and num.map.tasks=1


> -----Original Message-----
> From: Eric Baldeschwieler [mailto:eric14@yahoo-inc.com]
> Sent: Thursday, August 31, 2006 11:19 AM
> To: hadoop-user@lucene.apache.org
> Cc: Owen O'Malley
> Subject: Re: MapReduce: specify a *DFS* path for mapred.jar property
> 
> Interesting thread.
> 
> This relates to HADOOP-288.
> 
> Also the thread I started last week on using URLs in general for
> input arguments.  Seems like we should just take a URL for the jar,
> which could be file: or hdfs:
> 
> Thoughts?
> 
> On Aug 31, 2006, at 10:54 AM, Doug Cutting wrote:
> 
> > Frédéric Bertin wrote:
> >>> This should run clientside, since it depends on the username,
> >>> which is different on the server.
> >> then, what about passing the username as a parameter to the
> >> JobSubmissionProtocol.submitJob(...) ? This avoids loading the
> >> whole JobConf clientside just to set the username.
> >
> > That sounds like a reasonable change to me.
> >
> >>>> Why not moving it in the JobSubmissionProtocol (JobTracker's
> >>>> submitJob method) ?
> >>>
> >>> These could probably run on the server.  They're currently run on
> >>> the client in an attempt to return errors as quickly as possible
> >>> when jobs are misconfigured.
> >> Is it really quicker to make all those checkings remotely than
> >> remotely asking the JobTracker to make them locally? (just a
> >> question, I really have no idea of the answer)
> >
> > We'd need to be careful that this is not a synchronized method on
> > the server, so it doesn't interfere with other server activities.
> > Also, checking the input and output has to be much faster than the
> > RPC timeout, which it should be, since this just checks for the
> > existence of directories, not of individual files.
> >
> > Doug
> 



Mime
View raw message