hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Alekseyev <dnqu...@gmail.com>
Subject Is it possible to speed up select ... limit query?
Date Wed, 06 Oct 2010 17:58:06 GMT
Suppose I have a large-ish table (over a billion rows) and want to
grab the first 5 million or so.  When I run the query "create table
foo_subset as select col1, col2, col3 from foo limit 5000000", the job
launches one reducer, which runs for a while.  Looking at
HDFS_BYTES_READ/WRITTEN, I see that the reducer read the entire table
(250 gigs) to write 700 megs.  This seems wasteful.  Is there a reason
that a reducer can't somehow just count the records and terminate
after it writes 5000000 or whatever the limit is?  Are there
workarounds to make the process faster?..

Thanks,

--Leo

Mime
View raw message