hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Re: custom implementation of InputFormat/RecordReader/InputSplit?
Date Mon, 19 Nov 2007 17:10:54 GMT
On Mon, Nov 19, 2007 at 06:43:40PM +0200, Eugeny N Dzhurinsky wrote:
>Hello, gentlemen!
>I would like to implement a custom data provider which will create a records
>to start map jobs with them. For example I would like to create a thread which
>will extract some data from a storage (e.g. relational database) and start a
>new job, which will take single record and start map/reduce processing. Each
>of such record will produce a lot of results, which will be processed by
>reduce task later.
>The question is - how to implement such interfaces? As far as I learned, I
>would need to implement interfaces InputSplit, RecordReader and and
>InputFormat. However after looking at sources and javadocs I found all
>operations seems to be file-based, and this file could be split between
>several hosts, which isn't my case. I would deal with single stream I need to
>parse and start a job.

Yes Eugene, you are on the right track. As you have noted you would need custom InputFormat/InputSplit/RecordReader
implementations and we seem to have only file-based ones in our source tree.

One option is clearly to pre-fetch the records from the database and store it on HDFS or any
other filesystem (e.g. local filesystem).

However, it really shouldn't be very complicated to process data directly from the database.

I'd imagine that your custom InputFormat would return as many InputSplit(s) as the number
of maps you want to spawn (for e.g. COUNT(*)/no_desired_records_per_map) and your RecordReader.next(k,
v) would work with the InputSplit to track the record-ranges (rows) it is responsible for,
be cognizant of the 'record' type, fetch it from the database and feed it to the map. You
need to ensure that your InputSplit tracks the record-ranges (rows) you want each map to process
and that it implements the Writable interface for serialization.


View raw message