hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugeny N Dzhurinsky <b...@redwerk.com>
Subject Re: custom implementation of InputFormat/RecordReader/InputSplit?
Date Mon, 19 Nov 2007 18:45:58 GMT
On Mon, Nov 19, 2007 at 10:40:54PM +0530, Arun C Murthy wrote:
> Yes Eugene, you are on the right track. As you have noted you would need
> custom InputFormat/InputSplit/RecordReader implementations and we seem to
> have only file-based ones in our source tree.
> One option is clearly to pre-fetch the records from the database and store
> it on HDFS or any other filesystem (e.g. local filesystem).
> However, it really shouldn't be very complicated to process data directly
> from the database. 
> I'd imagine that your custom InputFormat would return as many InputSplit(s)
> as the number of maps you want to spawn (for e.g.
> COUNT(*)/no_desired_records_per_map) and your RecordReader.next(k, v) would
> work with the InputSplit to track the record-ranges (rows) it is responsible
> for, be cognizant of the 'record' type, fetch it from the database and feed
> it to the map. You need to ensure that your InputSplit tracks the
> record-ranges (rows) you want each map to process and that it implements the
> Writable interface for serialization.

Thank you for prompt response, it is really helpful - I know I'm not trying to
do a task with a wrong tool. However there are few things I probably forgot to

1) I can not know the number of records. In fact it is something like endless
loop, and the code which populates records from a database into a stream is a
bit complicated, and there could be cases when it would take few hours until a
new data will be prepared by a third-party application for processing, so the
producer thread (which fetches the records and passes them to the Hadoop
handlers) will just block and wait for the data.

2) I would like to maintain fixed number of jobs at a time, and not spawn a
new one until some of jobs ends - this means I would like to have some kind of
a job pool of fixed size (something similar to PoolingExecutor from java.concurrent
package). I assume it would not be hard to implement such logic over the
Hadoop, however if there is something which will ease this task within Hadoop - it 
would be great.

Thank you in advance!

Eugene N Dzhurinsky

View raw message