hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: randomly pick rows from data files
Date Wed, 08 Sep 2010 03:14:17 GMT
Copy FileInputFormat and make your own subclass of LineRecordReader. I 
did this same thing to make a nice CSV input reader. Yours will drop 
every Nth line.

This would be a very handy tool if you could pull N unique randomly 
chosen sample
  sets with no correlation, giving a value from 1 to N.

Lance

John Clarke wrote:
> Hi,
>
> I have a few large text files ~ 3 GBs of data in total with millions of rows
> of data. Each row only has one value.
>
> I want to randomly pick 20000 lines and output these as the result.
>
> Mu first thought was to have many mappers and one reducer and assign a
> random number as the key and let the sorter sort based on this key. The
> reducer would then output the first X (20k in this case) and exit.
>
> Is there a better way? I believe the above will work but it seems quite
> inefficient.
>
> Thanks,
> John
>
>    

Mime
View raw message