drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5152) Enhance the mock data source: better data, SQL access
Date Sat, 07 Jan 2017 04:32:58 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15806751#comment-15806751
] 

ASF GitHub Bot commented on DRILL-5152:
---------------------------------------

Github user sohami commented on a diff in the pull request:

    https://github.com/apache/drill/pull/708#discussion_r95051723
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/mock/ExtendedMockRecordReader.java
---
    @@ -0,0 +1,149 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.store.mock;
    +
    +import java.util.ArrayList;
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Set;
    +
    +import org.apache.drill.common.exceptions.ExecutionSetupException;
    +import org.apache.drill.common.types.TypeProtos.MajorType;
    +import org.apache.drill.exec.exception.OutOfMemoryException;
    +import org.apache.drill.exec.exception.SchemaChangeException;
    +import org.apache.drill.exec.expr.TypeHelper;
    +import org.apache.drill.exec.ops.FragmentContext;
    +import org.apache.drill.exec.ops.OperatorContext;
    +import org.apache.drill.exec.physical.impl.OutputMutator;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.apache.drill.exec.store.AbstractRecordReader;
    +import org.apache.drill.exec.store.mock.MockGroupScanPOP.MockColumn;
    +import org.apache.drill.exec.store.mock.MockGroupScanPOP.MockScanEntry;
    +import org.apache.drill.exec.vector.AllocationHelper;
    +import org.apache.drill.exec.vector.ValueVector;
    +
    +public class ExtendedMockRecordReader extends AbstractRecordReader {
    +
    +  private ValueVector[] valueVectors;
    +  private int batchRecordCount;
    +  private int recordsRead;
    +
    +  private final MockScanEntry config;
    +  private final FragmentContext context;
    +  private final ColumnDef fields[];
    +
    +  public ExtendedMockRecordReader(FragmentContext context, MockScanEntry config) {
    +    this.context = context;
    +    this.config = config;
    +
    +    fields = buildColumnDefs( );
    +  }
    +
    +  private ColumnDef[] buildColumnDefs() {
    +    List<ColumnDef> defs = new ArrayList<>( );
    +
    +    // Look for duplicate names. Bad things happen when the sama name
    +    // appears twice.
    +
    +    Set<String> names = new HashSet<>();
    +    MockColumn cols[] = config.getTypes();
    +    for ( int i = 0;  i < cols.length;  i++ ) {
    +      MockColumn col = cols[i];
    +      if (names.contains(col.name)) {
    +        throw new IllegalArgumentException("Duplicate column name: " + col.name);
    +      }
    +      names.add(col.name);
    +      int repeat = Math.min( 1, col.getRepeatCount( ) );
    +      if ( repeat == 1 ) {
    +        defs.add( new ColumnDef(col) );
    +      } else {
    +        for ( int j = 0;  j < repeat;  j++ ) {
    +          defs.add( new ColumnDef(col, j+1) );
    +        }
    +      }
    +    }
    +    ColumnDef[] defArray = new ColumnDef[defs.size()];
    +    defs.toArray(defArray);
    +    return defArray;
    +  }
    +
    +  private int getEstimatedRecordSize(MockColumn[] types) {
    +    int size = 0;
    +    for (int i = 0; i < fields.length; i++) {
    +      size += TypeHelper.getSize(fields[i].getConfig().getMajorType());
    +    }
    +    return size;
    +  }
    +
    +  private MaterializedField getVector(String name, MajorType type, int length) {
    --- End diff --
    
    `length` parameter is not used anywhere ?


> Enhance the mock data source: better data, SQL access
> -----------------------------------------------------
>
>                 Key: DRILL-5152
>                 URL: https://issues.apache.org/jira/browse/DRILL-5152
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Tools, Build & Test
>    Affects Versions: 1.9.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> Drill provides a mock data storage engine that generates random data. The mock engine
is used in some older unit tests that need a volume of data, but that are not too particular
about the details of the data.
> The mock data source continues to have use even for modern tests. For example, the work
in the external storage batch requires tests with varying amounts of data, but the exact form
of the data is not important, just the quantity. For example, if we want to ensure that spilling
happens at various trigger points, we need to read the right amount of data for that trigger.
> The existing mock data source has two limitations:
> 1. It generates only "black/white" (alternating) values, which is awkward for use in
sorting.
> 2. The mock generator is accessible only from a physical plan, but not from SQL queries.
> This enhancement proposes to fix both limitations:
> 1. Generate a uniform, randomly distributed set of values.
> 2. Provide an encoding that lets a SQL query specify the data to be generated.
> Example SQL query:
> {code}
> SELECT id_i, name_s50 FROM `mock`.employee_10K;
> {code}
> The above says to generate two fields: INTEGER (the "_i" suffix) and VARCHAR(50) (the
"_s50") suffix; and to generate 10,000 rows (the "_10K" suffix on the table.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message