hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: the question about the common pc?
Date Mon, 23 Feb 2009 11:14:04 GMT
Tim Wintle wrote:
> On Fri, 2009-02-20 at 13:07 +0000, Steve Loughran wrote:
>> I've been doing MapReduce work over small in-memory datasets 
>> using Erlang,  which works very well in such a context.
> 
> I've got some (mainly python) scripts (that will probably be run with
> hadoop streaming eventually) that I run over multiple cpus/cores on a
> single machine by opening the appropriate number of named pipes and
> using tee and awk to split the workload
> 
> something like
> 
>> mkfifo mypipe1
>> mkfifo mypipe2
>> awk '0 == NR % 2' < mypipe1 | ./mapper | sort > map_out_1&
>   awk '0 == (NR+1) % 2' < mypipe2 | ./mapper | sort > map_out_2&
>> ./get_lots_of_data | tee mypipe1 > mypipe2
> 
> (wait until it's done... or send a signal from the "get_lots_of_data"
> process on completion if it's a cronjob)
> 
>> sort -m map_out* | ./reducer > reduce_out
> 
> works around the global interpreter lock in python quite nicely and
> doesn't need people that write the scripts (who may not be programmers)
> to understand multiple processes etc, just stdin and stdout.
> 

Dumbo provides py support under Hadoop:
  http://wiki.github.com/klbostee/dumbo
  https://issues.apache.org/jira/browse/HADOOP-4304

as well as that, given Hadoop is java1.6+, there's no reason why it 
couldn't support the javax.script engine, with JavaScript working 
without extra JAR files, groovy and jython once their JARs were stuck on 
the classpath. Some work would probably be needed to make it easier to 
use these languages, and then there are the tests...

Mime
View raw message