Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 24354 invoked from network); 23 Oct 2008 20:31:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 23 Oct 2008 20:31:36 -0000 Received: (qmail 62422 invoked by uid 500); 23 Oct 2008 20:31:39 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 61776 invoked by uid 500); 23 Oct 2008 20:31:37 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 61765 invoked by uid 99); 23 Oct 2008 20:31:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Oct 2008 13:31:37 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Oct 2008 20:30:35 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4BD2F234C234 for ; Thu, 23 Oct 2008 13:30:44 -0700 (PDT) Message-ID: <1701853549.1224793844309.JavaMail.jira@brutus> Date: Thu, 23 Oct 2008 13:30:44 -0700 (PDT) From: "Ashish Thusoo (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4488) [Hive]: Add ability to compute statistics on hive tables In-Reply-To: <91653494.1224701024135.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642266#action_12642266 ] Ashish Thusoo commented on HADOOP-4488: --------------------------------------- all good points... comments are as follows: for 1. yes we can store this relatively easily - will add it. for 2. the number of bins is optional and not mandatory. We can store the system default as we do for the other variables in hive-conf.xml for 3. I am just planning to store the distinct values - no rounding or not storing them at all. Don't want to overload the semantics of this. Not sure how useful rounding is given that for 4. there are a number of other useful stats about strings, clearly prefixes are useful for like 'xyz%' kind of operations. We can perhaps add these later considering that we do not even have the base level stats. We can discuss this more to see what makes sense for like and regex kind of predicates. for 5. possible... though if we have sufficient number of bins the utility of this stat decreases. But will evaluate this nonetheless. for 6. implementable though computationally prohibitive and it is not very clear as to how much benefit this would give - clearly if most of the columns are weekly correlated (independent) then this is not of much use and many times that is quite true. Again this is more advanced stuff. Probably better in a follow on after the base level stats are working... Will also add to this list the avg size per column that you were mentioning yesterday. So the new list is: Table stats: 1. # rows 2. size of partition 3. Avg size of a row 4. # blocks 5. # files Columns stats: 1. # distinct values 2. # null values 3. min 3 values 4. max 3 values 5. histogram: frequency and height balanced. 6. avg size of column > [Hive]: Add ability to compute statistics on hive tables > -------------------------------------------------------- > > Key: HADOOP-4488 > URL: https://issues.apache.org/jira/browse/HADOOP-4488 > Project: Hadoop Core > Issue Type: New Feature > Components: contrib/hive > Reporter: Ashish Thusoo > Assignee: Ashish Thusoo > > Add commands to collect partition and column level statistics in hive. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.