Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D9A6D200BED for ; Tue, 22 Nov 2016 05:12:15 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id D863C160AF9; Tue, 22 Nov 2016 04:12:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E9ABB160B21 for ; Tue, 22 Nov 2016 05:12:14 +0100 (CET) Received: (qmail 51473 invoked by uid 500); 22 Nov 2016 04:12:14 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 51213 invoked by uid 99); 22 Nov 2016 04:12:13 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Nov 2016 04:12:13 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id B4D15F1593; Tue, 22 Nov 2016 04:12:12 +0000 (UTC) From: tejasapatil To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org References: In-Reply-To: Subject: [GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s... Content-Type: text/plain Message-Id: <20161122041212.B4D15F1593@git1-us-west.apache.org> Date: Tue, 22 Nov 2016 04:12:12 +0000 (UTC) archived-at: Tue, 22 Nov 2016 04:12:16 -0000 Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/15959#discussion_r89039895 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala --- @@ -58,60 +61,127 @@ case class Statistics( } } + /** - * Statistics for a column. + * Statistics collected for a column. + * + * 1. Supported data types are defined in `ColumnStat.supportsType`. + * 2. The JVM data type stored in min/max is the external data type (used in Row) for the + * corresponding Catalyst data type. For example, for DateType we store java.sql.Date, and for + * TimestampType we store java.sql.Timestamp. + * 3. For integral types, they are all upcasted to longs, i.e. shorts are stored as longs. + * + * @param ndv number of distinct values + * @param min minimum value + * @param max maximum value + * @param numNulls number of nulls + * @param avgLen average length of the values. For fixed-length types, this should be a constant. + * @param maxLen maximum length of the values. For fixed-length types, this should be a constant. */ -case class ColumnStat(statRow: InternalRow) { +// TODO: decide if we want to use bigint to represent ndv and numNulls. +case class ColumnStat( --- End diff -- can you add some basic sanity checks ? eg. - `max >= min` - `maxLen >= avgLen` - `if (ndv == 1) then min == max` Floats / decimals might behave badly but its good to check before anyone consumes these stats. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org