hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi
Date Mon, 16 Aug 2010 21:50:33 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=19&rev2=20

--------------------------------------------------

  limit 11;
  }}}
  
- This works by ordering all of the rows in a sample of the table (using a single reducer),
and then selecting every nth row (here n=910000).  The value of n is chosen by dividing the
total number of rows in the table by the desired number of ranges, e.g. 12 in this case (one
more than the number of partitioning keys produced by the LIMIT clause).  The assumption here
is that the distribution in the sample matches the overall distribution in the table; if this
is not the case, the resulting partition keys will lead to skew in the parallel sort.
+ This works by ordering all of the rows in a .01% sample of the table (using a single reducer),
and then selecting every nth row (here n=910000).  The value of n is chosen by dividing the
total number of rows in the sample by the desired number of ranges, e.g. 12 in this case (one
more than the number of partitioning keys produced by the LIMIT clause).  The assumption here
is that the distribution in the sample matches the overall distribution in the table; if this
is not the case, the resulting partition keys will lead to skew in the parallel sort.
  
  Once you have your sampling query defined, the next step is to save its results to a properly
formatted file which will be used in a subsequent step.  To do this, run commands like the
following:
  

Mime
View raw message