hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wasim Bari" <wasimb...@msn.com>
Subject Re: Reading a subset of records from hdfs
Date Thu, 10 Sep 2009 07:57:11 GMT
    I also have similar kind of question.

Is it possible for a job to start reading a file ( start split ) from a 
specific position in file rather than beginning. Idea is I have some 
information in a file, first part of information can only be read in 
sequence not parallel, so I get this data with isSplitable= false. Now when 
I have this data next part of search can be made parallel. So its better to 
start searching from this point onward in parallel with multiple mappers ( 
isSiplitable = true ).


From: "Amandeep Khurana" <amansk@gmail.com>
Sent: Thursday, September 10, 2009 9:49 AM
To: <common-user@hadoop.apache.org>
Subject: Re: Reading a subset of records from hdfs

> Why not just have a higher number of mappers? Why split into multiple
> jobs? Any particular case that you think this will be useful in?
> On 9/9/09, Rakhi Khatwani <rkhatwani@gmail.com> wrote:
>> Hi,
>>        Suppose i have a hdfs file with 10,000 entries. and i want my job 
>> to
>> process 100 records at one time (to minimize loss of data during job
>> crashes/ network errors etc). so if a job can read a subset of records 
>> from
>> a fine in HDFS, i can combine with chaining to achieve my objective.  for
>> example i have job1 which reads 1-100 lines of input from hdfs, and job 2
>> which reads from 101-200 lines of input...etc.
>>  is there a way in which you can configure a job 2 read only a subset of
>> records from a file in HDFS.
>> Regards,
>> Raakhi
> -- 
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz

View raw message