hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patterson, Josh" <jpatters...@tva.gov>
Subject RE: RecordReader design heuristic
Date Wed, 18 Mar 2009 17:00:27 GMT
Yeah, the mapper sitting on a dfs block is pretty cool.

Also, yes, we are about to start crunching on a lot of energy smart grid
data. TVA is sorta like "Switzerland" for smart grid power generation
and transmission data across the nation. Right now we have about 12TB,
and this is slated to be around 30TB by the end of next 2010 (possibly
more, depending on how many more PMUs come online). I am very interested
in Mahout and have read up on it, it has many algorithms that I am
familiar with from grad school. I will be doing some very simple MR jobs
early on like finding the average frequency for a range of data, and
I've been selling various groups internally on what CAN be done with
good data mining and tools like Hadoop/Mahout. Our production cluster
wont be online for a few more weeks, but that part is already rolling so
I've moved on to focus on designing the first jobs to find quality
"results/benefits" that I can "sell" in order to campaign for more
ambitious projects I have drawn up. I know time series data lends itself
to many machine learning applications, so, yes, I would be very
interested in talking to anyone who wants to talk or share notes on
hadoop and machine learning. I believe Mahout can be a tremendous
resource for us and definitely plan on running and contributing to it.

Josh Patterson

-----Original Message-----
From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
Sent: Wednesday, March 18, 2009 12:02 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic

Hi Josh,
It seemed like you had a conceptual wire crossed and I'm glad to help 
out. The neat thing about Hadoop mappers is - since they are given a 
replicated HDFS block to munch on - the job scheduler has <replication 
factor> number of node choices where it can run each mapper. This means 
mappers are always reading from local storage.

On another note, I notice you are processing what looks to be large 
quantities of vector data. If you have any interest in clustering this 
data you might want to look at the Mahout project 
(http://lucene.apache.org/mahout/). We have a number of Hadoop-ready 
clustering algorithms, including a new non-parametric Dirichlet Process 
Clustering implementation that I committed recently. We are pulling it 
all together for a 0.1 release and I would be very interested in helping

you to apply these algorithms if you have an interest.


Patterson, Josh wrote:
> Jeff,
> ok, that makes more sense, I was under the mis-impression that it was
creating and destroying mappers for each input record. I dont know why I
had that in my head. My design suddenly became a lot clearer, and this
provides a much more clean abstraction. Thanks for your help!
> Josh Patterson

View raw message