Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <B648F67990B15146B8646584D20B8DD10794F734@TVACOCXVS2.main.tva.gov>
References: <B648F67990B15146B8646584D20B8DD10794F731@TVACOCXVS2.main.tva.gov>
	 <49C011CB.7070305@windwardsolutions.com>
	 <B648F67990B15146B8646584D20B8DD10794F732@TVACOCXVS2.main.tva.gov>
	 <49C01DDA.5090204@windwardsolutions.com>
	 <B648F67990B15146B8646584D20B8DD10796C7B9@TVACOCXVS2.main.tva.gov>
	 <49C11B05.2010302@windwardsolutions.com>
	 <B648F67990B15146B8646584D20B8DD10794F733@TVACOCXVS2.main.tva.gov>
	 <ac79ea400903181020q169e8770kf514500c7df9cf3c@mail.gmail.com>
	 <B648F67990B15146B8646584D20B8DD10794F734@TVACOCXVS2.main.tva.gov>
Date: Wed, 18 Mar 2009 17:47:23 +0000
Message-ID: <ac79ea400903181047ka4e6df0r11d89e125c4232e1@mail.gmail.com>
Subject: Re: RecordReader design heuristic
From: Tom White <tom@cloudera.com>
To: core-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Josh,

Sounds like your file format is similar to Hadoop's SequenceFile
(although it has its metadata at the head of the file), so I might be
worth considering that going forward. SequenceFile also has support
for block compression which can be useful (and the files can still be
split).

The approach you describe for splitting your existing files sounds good.

Tom

On Wed, Mar 18, 2009 at 5:39 PM, Patterson, Josh <jpatterson0@tva.gov> wrote:
> Hi Tom,
> Yeah, I'm assuming the splits are going to be about a single dfs block
> size (64M here). Each file I'm working with is around 1.5GB in size, and
> has a sort of File Allocation Table at the very end which tells you the
> block sizes inside the file, and then some other info. Once I pull that
> info out of the tail end of the file, I can calculate what "internal
> blocks" lie inside the split byte ranges, pull those out and push the
> individual data points up to the mapper, as well as deal with any block
> that falls over the split range (I'm assuming right now I'll use the
> same idea as the line-oriented reader, and just read all blocks that
> fall over the end point of the split, unless its the first split
> section). I guess the only hit I'm going to take here is having to ask
> the dfs for a quick read into the last 16 bytes of the whole file where
> my file info is stored. Splitting this file format doesn't seem to be so
> bad, its just finding which multiples of the "internal file block" size
> fits inside the split range, its just getting that multiple factor
> beforehand.
>
> After I get some mechanics of the process down, and I show the team some
> valid results, I may be able to talk them into going to another format
> that works better with MR. If anyone has any ideas on what file formats
> works best for storing and processing large amounts of time series
> points with MR, I'm all ears. We're moving towards a new philosophy wrt
> big data so it's a good time for us to examine best practices going
> forward.
>
> Josh Patterson
> TVA
>
> -----Original Message-----
> From: Tom White [mailto:tom@cloudera.com]
> Sent: Wednesday, March 18, 2009 1:21 PM
> To: core-user@hadoop.apache.org
> Subject: Re: RecordReader design heuristic
>
> Hi Josh,
>
> The other aspect to think about when writing your own record reader is
> input splits. As Jeff mentioned you really want mappers to be
> processing about one HDFS block's worth of data. If your inputs are
> significantly smaller, the overhead of creating mappers will be high
> and your jobs will be inefficient. On the other hand, if your inputs
> are significantly larger then you need to split them otherwise each
> mapper will take a very long time processing each split. Some file
> formats are inherently splittable, meaning you can re-align with
> record boundaries from an arbitrary point in the file. Examples
> include line-oriented text (split at newlines), and bzip2 (has a
> unique block marker). If your format is splittable then you will be
> able to take advantage of this to make MR processing more efficient.
>
> Cheers,
> Tom
>
> On Wed, Mar 18, 2009 at 5:00 PM, Patterson, Josh <jpatterson0@tva.gov>
> wrote:
>> Jeff,
>> Yeah, the mapper sitting on a dfs block is pretty cool.
>>
>> Also, yes, we are about to start crunching on a lot of energy smart
> grid
>> data. TVA is sorta like "Switzerland" for smart grid power generation
>> and transmission data across the nation. Right now we have about 12TB,
>> and this is slated to be around 30TB by the end of next 2010 (possibly
>> more, depending on how many more PMUs come online). I am very
> interested
>> in Mahout and have read up on it, it has many algorithms that I am
>> familiar with from grad school. I will be doing some very simple MR
> jobs
>> early on like finding the average frequency for a range of data, and
>> I've been selling various groups internally on what CAN be done with
>> good data mining and tools like Hadoop/Mahout. Our production cluster
>> wont be online for a few more weeks, but that part is already rolling
> so
>> I've moved on to focus on designing the first jobs to find quality
>> "results/benefits" that I can "sell" in order to campaign for more
>> ambitious projects I have drawn up. I know time series data lends
> itself
>> to many machine learning applications, so, yes, I would be very
>> interested in talking to anyone who wants to talk or share notes on
>> hadoop and machine learning. I believe Mahout can be a tremendous
>> resource for us and definitely plan on running and contributing to it.
>>
>> Josh Patterson
>> TVA
>>
>> -----Original Message-----
>> From: Jeff Eastman [mailto:jdog@windwardsolutions.com]
>> Sent: Wednesday, March 18, 2009 12:02 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: RecordReader design heuristic
>>
>> Hi Josh,
>> It seemed like you had a conceptual wire crossed and I'm glad to help
>> out. The neat thing about Hadoop mappers is - since they are given a
>> replicated HDFS block to munch on - the job scheduler has <replication
>> factor> number of node choices where it can run each mapper. This
> means
>> mappers are always reading from local storage.
>>
>> On another note, I notice you are processing what looks to be large
>> quantities of vector data. If you have any interest in clustering this
>> data you might want to look at the Mahout project
>> (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready
>> clustering algorithms, including a new non-parametric Dirichlet
> Process
>> Clustering implementation that I committed recently. We are pulling it
>> all together for a 0.1 release and I would be very interested in
> helping
>>
>> you to apply these algorithms if you have an interest.
>>
>> Jeff
>>
>>
>> Patterson, Josh wrote:
>>> Jeff,
>>> ok, that makes more sense, I was under the mis-impression that it was
>> creating and destroying mappers for each input record. I dont know why
> I
>> had that in my head. My design suddenly became a lot clearer, and this
>> provides a much more clean abstraction. Thanks for your help!
>>>
>>> Josh Patterson
>>> TVA
>>>
>>>
>>
>>
>