hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Khurana <skhurana...@gmail.com>
Subject Hive TableSample with number of rows.
Date Thu, 24 Mar 2016 10:07:28 GMT
Hello

Hive provides a table sample approach for number of rows. The documentation
is at
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling#LanguageManualSampling-BlockSampling

It states

"For example, the following query will take the first 10 rows from each
input split.
SELECT * FROM source TABLESAMPLE(10 ROWS);
"

But when I look at the code, FetchOperator.java at
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java

I see below method, check the bold and underlined lines. It looks like it
is exiting the sampling as and when the number of recs (size) is obtained
from the splits i.e. if first input split gives the needed data then it
wont go over rest of the splits and recs from 1st split itself will be
returned. But this is in contradiction to what the documentation states.

When I run query for tablesapme with number of rows I also get the rows
from same split. I validated this by selecting "INPUT__FILE__NAME" as well
(my data on hdfs has thousands of files) .

Am I missing something or is it a bug?

private FetchInputFormatSplit[] splitSampling(SplitSample splitSample,
      FetchInputFormatSplit[] splits) {
    long totalSize = 0;
    for (FetchInputFormatSplit split: splits) {
        totalSize += split.getLength();
    }
    List<FetchInputFormatSplit> result = new
ArrayList<FetchInputFormatSplit>(splits.length);
   * long targetSize = splitSample.getTargetSize(totalSize);*
    int startIndex = splitSample.getSeedNum() % splits.length;
    long size = 0;
    for (int i = 0; i < splits.length; i++) {
      FetchInputFormatSplit split = splits[(startIndex + i) %
splits.length];
      result.add(split);
      long splitgLength = split.getLength();
      if (size + splitgLength >= targetSize) {
   *     if (size + splitgLength > targetSize) {*
*          split.shrinkedLength = targetSize - size;*
*        }*
*        break;*
*      }*
      size += splitgLength;
    }
    return result.toArray(new FetchInputFormatSplit[result.size()]);
  }

HIve bug for this is , https://issues.apache.org/jira/browse/HIVE-3401 .


-- 
Thanks and regards
Sandeep Khurana

Mime
View raw message