Return-Path: X-Original-To: apmail-incubator-drill-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-drill-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 60324D42F for ; Fri, 14 Sep 2012 21:16:07 +0000 (UTC) Received: (qmail 46386 invoked by uid 500); 14 Sep 2012 21:16:07 -0000 Delivered-To: apmail-incubator-drill-dev-archive@incubator.apache.org Received: (qmail 46330 invoked by uid 500); 14 Sep 2012 21:16:07 -0000 Mailing-List: contact drill-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: drill-dev@incubator.apache.org Delivered-To: mailing list drill-dev@incubator.apache.org Received: (qmail 46322 invoked by uid 99); 14 Sep 2012 21:16:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Sep 2012 21:16:07 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of amir.youssefi@gmail.com designates 209.85.160.47 as permitted sender) Received: from [209.85.160.47] (HELO mail-pb0-f47.google.com) (209.85.160.47) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Sep 2012 21:15:59 +0000 Received: by pbcwy7 with SMTP id wy7so5906123pbc.6 for ; Fri, 14 Sep 2012 14:15:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:in-reply-to:mime-version:content-transfer-encoding :content-type:message-id:cc:x-mailer:from:subject:date:to; bh=R3zunFfYhO8eDwuOKDnMGJccVeQAGH3VjeB6a8Q7TQc=; b=Ct0QN/nYwAqISMIWXHYRKyP95JfrXIyYkoKlXyiGCmchEzIIsnjX1v9Qm4ec+Ts6Sf fJW1wJ/0ozL0DlIag1VwcAzi2kzT4u8jdK+vQ15MOqVGnG0Pt96VGeDZI49GLEShlAk3 glosBl9awKRULb5C62p2116eUth/HV+U/wrLFKv4l0ICFfF2JpOpq6OXWPMHdgoEXw7y ujuB0Uy5h/AJtsv5hXRHSNt+gzOvw9wQS8I2CqJ3Oc93zNz1mI1rz6emclUmOO1nZQnG pQcN9dskWxj2ujLcv0PkfLXwRQaUSl/mG2VKLvkymgLzbVgM2rOxjYo/MIhIZxv3HmcW ZLCw== Received: by 10.68.220.201 with SMTP id py9mr6534225pbc.137.1347657338591; Fri, 14 Sep 2012 14:15:38 -0700 (PDT) Received: from [10.39.138.100] (mobile-198-228-213-105.mycingular.net. [198.228.213.105]) by mx.google.com with ESMTPS id it5sm1600056pbc.10.2012.09.14.14.15.33 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 14 Sep 2012 14:15:34 -0700 (PDT) References: In-Reply-To: Mime-Version: 1.0 (1.0) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Message-Id: <4DD0EF35-E83C-4520-BAF3-1A29E8F76D2A@gmail.com> Cc: "drill-dev@incubator.apache.org" X-Mailer: iPhone Mail (9B206) From: Amir Youssefi Subject: Re: Drill native format Date: Fri, 14 Sep 2012 14:15:28 -0700 To: "drill-dev@incubator.apache.org" "Nested data is not yet implemented" in BigQuery (if I recall exact words co= rrectly). Quoting speaker at the BigQuery presentation at Google Technology U= ser Group last week in Googleplex (intentionally not citing speaker's name).= -ay On Sep 14, 2012, at 1:28 PM, David Gruzman wrote: > I assume that evolution of BigQuery reflects resolution of Dremel... If > somebody have information on it it would be great. > Storage system should understand that all file comprising the horizontal > partition of the table are one logical entity, and should store them > together / in some proximity. I agree that PAX will be much more > convinient. The question is - is there performance penalty of PAX vs file > per column? > David >=20 > On Fri, Sep 14, 2012 at 11:21 PM, Tomer Shiran wrot= e: >=20 >> Is there any public information suggesting that Google moved away from >> supporting nested data? Clearly BigQuery doesn't yet allow nested data, b= ut >> not sure that applies to Dremel. >>=20 >> There are challenges with one file per column. How do you ensure that a >> single record is located on a single machine to avoid costly record >> reconstruction? >>=20 >> On Fri, Sep 14, 2012 at 1:05 PM, David Gruzman >> wrote: >>=20 >>> Hi All, >>> I would like to discuss the question of what will be native format for >>> drill. Original Google dremel paper defined their hierarchical columnar >>> data format. Since then >>> google shifted from hierarchical data format... So it is a question if i= t >>> makes sense to stick with it? >>> If we are also moving to simple flat format we need our own format we >> have >>> to support "native". In case of Drill I would define that native support= >> as >>> "high performance". >>> I think we can go to some kind of PAX format with comprehensive metadata= >> in >>> the header, so each file is completely self contained and can be >> understood >>> and processed without any external data. >>> Alternative is to have single file per column. As far as I remember from= >>> our OpenDremel work the main decision point is - if we can read one >> column >>> from the file without loading into node memory unnecessary data from >> other >>> columns. >>> With best regards, >>> David >>>=20 >>=20 >>=20 >>=20 >> -- >> Tomer Shiran >> Director of Product Management | MapR Technologies | 650-804-8657 >>=20