Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2A829177D7 for ; Fri, 27 Feb 2015 16:19:22 +0000 (UTC) Received: (qmail 96908 invoked by uid 500); 27 Feb 2015 16:19:22 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 96875 invoked by uid 500); 27 Feb 2015 16:19:21 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 96863 invoked by uid 99); 27 Feb 2015 16:19:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Feb 2015 16:19:21 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of altekrusejason@gmail.com designates 209.85.216.41 as permitted sender) Received: from [209.85.216.41] (HELO mail-qa0-f41.google.com) (209.85.216.41) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Feb 2015 16:19:17 +0000 Received: by mail-qa0-f41.google.com with SMTP id x12so13449756qac.0 for ; Fri, 27 Feb 2015 08:16:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=B9AbOmcElsG3BRunAuEiUdXD528aVJEOBHnGpHIWogU=; b=kQzp+Tx0Itb9uPYOmmXlgG/145gM04axvFww4iJE+GVwBmh262UeUUDaM4pfeH8ahg HtjbVYCWuPdgYcO1J6U/gaX+Cut+3iWdJ+7SwQlH9IH90A4H02h1SKzOm2DyBOwYNCtf +ryh+5QYrgD7jo0+s8hTdgbfQJalGaUl1gz5a3SJ4lPGxZYddAjLpSrNjlO3dvNXxaY+ ha+tK76c28nTXgZNjQdRjM69cdIBR0SCQLw6UWpeznq9pf3C9CH7/0Xr1QLj5W+Vbcpg xQ6A/xUi7hk9y75nNh1+VTLEniI3YBWez3KaSuPvb0r4lnyRvHzRDHNXXwaLj/VHQlEw eiAw== X-Received: by 10.229.216.71 with SMTP id hh7mr30959722qcb.0.1425053801578; Fri, 27 Feb 2015 08:16:41 -0800 (PST) MIME-Version: 1.0 Received: by 10.140.40.201 with HTTP; Fri, 27 Feb 2015 08:16:21 -0800 (PST) In-Reply-To: References: From: Jason Altekruse Date: Fri, 27 Feb 2015 08:16:21 -0800 Message-ID: Subject: Re: understanding groupCount & valueCount in repeated vectors To: dev@drill.apache.org Cc: drill Content-Type: multipart/alternative; boundary=001a1134a784a124460510143189 X-Virus-Checked: Checked by ClamAV on apache.org --001a1134a784a124460510143189 Content-Type: text/plain; charset=UTF-8 Hanifi, I think we should try to avoid using the word 'cell' to refer to elements within a single value. We often explain the concept of complex data in Drill by describing a list or map type being stored in a single database 'cell'. Overall I totally agree with the lack of clarity, I would advocate for something like getChildCount for the number of members below the lists, as current database language does not include hierarchies/nesting I think this is a safe naming convention. In response to Jacques comments, we might be at a loss with trying to unify the concepts of individual values in the case of scalar vectors and entire lists/nested structures with a simple name change. It might just be clearest to document the getValueCount method at the top level value vector interface to clearly state that it should match the number of records. Even beyond the issue of repeated confusion, this number also currently includes nulls, which some devs might find confusing if we don't document it. -Jason On Fri, Feb 27, 2015 at 6:24 AM, Jacques Nadeau wrote: > I think value is the problem word. I'm not sure it is better for groupings > or cells in the case of repeated types. What do they use in Parquet? > > I'd also like to see this proposal in the context of a larger proposed > design spec for that jira. > On Feb 26, 2015 5:52 PM, "Hanifi Gunes" wrote: > > > Hey everyone, > > > > Scalar ValueVector(VV) types implement getValueCount method, which > returns > > the number of "value"s stored in the vector. I would expect the same be > > true for RepeatedVVs as well. However, getValueCount on repeated types > > report number of inner/sub-values stored and introduces another method > > called groupCount to report actual number of "value"s stored. > > > > This becomes really confusing and somewhat inconsistent (especially for > > RepeatedList) as one would expect #getValueCount should report the number > > of values regardless if the stored value type is nested or flat. > > > > As part of DRILL-2150, I am refactoring VVs so that getValueCount > > universally returns the number of values stored. Alongside, I plan to > > introduce a new method getCellCount that reports total number of > > sub-values/cells stored in a repeated vector. > > > > I'd like to probe if anyone has any concerns relating to this. Please let > > me know. > > > > > > Thanks. > > -Hanifi > > > --001a1134a784a124460510143189--