Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm
Precedence: bulk
From: tejasapatil <git@git.apache.org>
To: reviews@spark.apache.org
Reply-To: reviews@spark.apache.org
References: <git-pr-15959-spark@git.apache.org>
In-Reply-To: <git-pr-15959-spark@git.apache.org>
Subject: [GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...
Content-Type: text/plain
Message-Id: <20161122041212.B4D15F1593@git1-us-west.apache.org>
Date: Tue, 22 Nov 2016 04:12:12 +0000 (UTC)
archived-at: Tue, 22 Nov 2016 04:12:16 -0000

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15959#discussion_r89039895
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -58,60 +61,127 @@ case class Statistics(
       }
     }
     
    +
     /**
    - * Statistics for a column.
    + * Statistics collected for a column.
    + *
    + * 1. Supported data types are defined in `ColumnStat.supportsType`.
    + * 2. The JVM data type stored in min/max is the external data type (used in Row) for the
    + * corresponding Catalyst data type. For example, for DateType we store java.sql.Date, and for
    + * TimestampType we store java.sql.Timestamp.
    + * 3. For integral types, they are all upcasted to longs, i.e. shorts are stored as longs.
    + *
    + * @param ndv number of distinct values
    + * @param min minimum value
    + * @param max maximum value
    + * @param numNulls number of nulls
    + * @param avgLen average length of the values. For fixed-length types, this should be a constant.
    + * @param maxLen maximum length of the values. For fixed-length types, this should be a constant.
      */
    -case class ColumnStat(statRow: InternalRow) {
    +// TODO: decide if we want to use bigint to represent ndv and numNulls.
    +case class ColumnStat(
    --- End diff --
    
    can you add some basic sanity checks ? eg. 
    - `max >= min`
    - `maxLen  >= avgLen` 
    - `if (ndv == 1) then min == max`
    
    Floats / decimals might behave badly but its good to check before anyone consumes these stats.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org