hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: randomly pick rows from data files
Date Wed, 08 Sep 2010 03:14:17 GMT
Copy FileInputFormat and make your own subclass of LineRecordReader. I 
did this same thing to make a nice CSV input reader. Yours will drop 
every Nth line.

This would be a very handy tool if you could pull N unique randomly 
chosen sample
  sets with no correlation, giving a value from 1 to N.


John Clarke wrote:
> Hi,
> I have a few large text files ~ 3 GBs of data in total with millions of rows
> of data. Each row only has one value.
> I want to randomly pick 20000 lines and output these as the result.
> Mu first thought was to have many mappers and one reducer and assign a
> random number as the key and let the sorter sort based on this key. The
> reducer would then output the first X (20k in this case) and exit.
> Is there a better way? I believe the above will work but it seems quite
> inefficient.
> Thanks,
> John

View raw message