hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Clarke <clarke...@gmail.com>
Subject randomly pick rows from data files
Date Mon, 06 Sep 2010 12:24:45 GMT

I have a few large text files ~ 3 GBs of data in total with millions of rows
of data. Each row only has one value.

I want to randomly pick 20000 lines and output these as the result.

Mu first thought was to have many mappers and one reducer and assign a
random number as the key and let the sorter sort based on this key. The
reducer would then output the first X (20k in this case) and exit.

Is there a better way? I believe the above will work but it seems quite


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message