hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Khurana <skhurana...@gmail.com>
Subject Re: Hive TableSample with number of rows.
Date Sat, 26 Mar 2016 10:18:11 GMT
Is it worth raising a bug in hive ?

On Thu, Mar 24, 2016 at 3:37 PM, Sandeep Khurana <skhurana333@gmail.com>
wrote:

> Hello
>
> Hive provides a table sample approach for number of rows. The
> documentation is at
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling#LanguageManualSampling-BlockSampling
>
> It states
>
> "For example, the following query will take the first 10 rows from each
> input split.
> SELECT * FROM source TABLESAMPLE(10 ROWS);
> "
>
> But when I look at the code, FetchOperator.java at
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
>
> I see below method, check the bold and underlined lines. It looks like it
> is exiting the sampling as and when the number of recs (size) is obtained
> from the splits i.e. if first input split gives the needed data then it
> wont go over rest of the splits and recs from 1st split itself will be
> returned. But this is in contradiction to what the documentation states.
>
> When I run query for tablesapme with number of rows I also get the rows
> from same split. I validated this by selecting "INPUT__FILE__NAME" as well
> (my data on hdfs has thousands of files) .
>
> Am I missing something or is it a bug?
>
> private FetchInputFormatSplit[] splitSampling(SplitSample splitSample,
>       FetchInputFormatSplit[] splits) {
>     long totalSize = 0;
>     for (FetchInputFormatSplit split: splits) {
>         totalSize += split.getLength();
>     }
>     List<FetchInputFormatSplit> result = new
> ArrayList<FetchInputFormatSplit>(splits.length);
>    * long targetSize = splitSample.getTargetSize(totalSize);*
>     int startIndex = splitSample.getSeedNum() % splits.length;
>     long size = 0;
>     for (int i = 0; i < splits.length; i++) {
>       FetchInputFormatSplit split = splits[(startIndex + i) %
> splits.length];
>       result.add(split);
>       long splitgLength = split.getLength();
>       if (size + splitgLength >= targetSize) {
>    *     if (size + splitgLength > targetSize) {*
> *          split.shrinkedLength = targetSize - size;*
> *        }*
> *        break;*
> *      }*
>       size += splitgLength;
>     }
>     return result.toArray(new FetchInputFormatSplit[result.size()]);
>   }
>
> HIve bug for this is , https://issues.apache.org/jira/browse/HIVE-3401 .
>
>
> --
> Thanks and regards
> Sandeep Khurana
>



-- 
Thanks and regards
Sandeep Khurana

Mime
View raw message