Mailing-List: contact drill-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: drill-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of yangzhuoluo@gmail.com
 designates 209.85.216.54 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFNsrOeN12MALsHQECsqoteiZUF9k5uq5QVzGgmZM4VeTi4YrA@mail.gmail.com>
References: 
 <CACzgUQiY6iMQW6MO9GnK1x5oNXxLxc7tN-d9aeCFpUe-U72b7A@mail.gmail.com>
 <CAMHgjMpP4udZj=EWRTukbJrkb8xJtwF1qm7m-iO_S6==HjgAmw@mail.gmail.com>
 <CACzgUQge_bFyRWUXypQK_VAY=BC1V=9qjVUMfH3KufSzgnm6VA@mail.gmail.com>
 <4DD0EF35-E83C-4520-BAF3-1A29E8F76D2A@gmail.com>
 <CAFNsrOeN12MALsHQECsqoteiZUF9k5uq5QVzGgmZM4VeTi4YrA@mail.gmail.com>
From: =?UTF-8?B?Q2xhcmsgWWFuZyAo5p2o5Y2T6I2mKQ==?= <yangzhuoluo@gmail.com>
Date: Sat, 15 Sep 2012 10:26:14 +0800
Message-ID: 
 <CABHfxz5Ka2B+k_TaTD1y=HnS=aBgpm80ZtYuD0WpTWfaFdFsVA@mail.gmail.com>
Subject: Re: Drill native format
To: drill-dev@incubator.apache.org
Content-Type: multipart/alternative; boundary=485b397dd68df476a004c9b4441f

--485b397dd68df476a004c9b4441f
Content-Type: text/plain; charset=ISO-8859-1

Hi

I have been working on the column storage for a while.
I think the most important thing for the distributed column storage is data
locality on MapReduce (See the paper 4.1).
That means how each horizontal partition stores in the same node to compute
locally and reduce data transfer. To achieve this, the big data is usually
horizontally partitioned and distributed first and vertically partitioned
second. There need some strategies to do this, HDFS use "block placement
policy"


Cheers,
Zhuoluo (Clark) Yang


2012/9/15 karthik tunga <karthik.tunga@gmail.com>

> Hi,
>
> This paper (http://arxiv.org/pdf/1105.4252.pdf) has column oriented (one
> file per column) vs RCFile.
> They use skip list and lazy record construction.
>
> Cheers,
> Karthik
>
> On 14 September 2012 17:15, Amir Youssefi <amir.youssefi@gmail.com> wrote:
>
> > "Nested data is not yet implemented" in BigQuery (if I recall exact words
> > correctly). Quoting speaker at the BigQuery presentation at Google
> > Technology User Group last week in Googleplex (intentionally not citing
> > speaker's name).
> >
> > -ay
> >
> > On Sep 14, 2012, at 1:28 PM, David Gruzman <david@bigdatacraft.com>
> wrote:
> >
> > > I assume that evolution of BigQuery reflects resolution of Dremel... If
> > > somebody have information on it it would be great.
> > > Storage system should understand that all file comprising the
> horizontal
> > > partition of the table are one logical entity, and should store them
> > > together / in some proximity. I agree that PAX will be much more
> > > convinient. The question is - is there performance penalty of PAX vs
> file
> > > per column?
> > > David
> > >
> > > On Fri, Sep 14, 2012 at 11:21 PM, Tomer Shiran <tshiran@maprtech.com>
> > wrote:
> > >
> > >> Is there any public information suggesting that Google moved away from
> > >> supporting nested data? Clearly BigQuery doesn't yet allow nested
> data,
> > but
> > >> not sure that applies to Dremel.
> > >>
> > >> There are challenges with one file per column. How do you ensure that
> a
> > >> single record is located on a single machine to avoid costly record
> > >> reconstruction?
> > >>
> > >> On Fri, Sep 14, 2012 at 1:05 PM, David Gruzman <
> david@bigdatacraft.com
> > >>> wrote:
> > >>
> > >>> Hi All,
> > >>> I would like to discuss the question of what will be native format
> for
> > >>> drill. Original Google dremel paper defined their hierarchical
> columnar
> > >>> data format. Since then
> > >>> google shifted from hierarchical data format... So it is a question
> if
> > it
> > >>> makes sense to stick with it?
> > >>> If we are also moving to simple flat format we need our own format we
> > >> have
> > >>> to support "native". In case of Drill I would define that native
> > support
> > >> as
> > >>> "high performance".
> > >>> I think we can go to some kind of PAX format with comprehensive
> > metadata
> > >> in
> > >>> the header, so each file is completely self contained and can be
> > >> understood
> > >>> and processed without any external data.
> > >>> Alternative is to have single file per column. As far as I remember
> > from
> > >>> our OpenDremel work the main decision point is - if we can read one
> > >> column
> > >>> from the  file without loading into node memory unnecessary data from
> > >> other
> > >>> columns.
> > >>> With best regards,
> > >>> David
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Tomer Shiran
> > >> Director of Product Management | MapR Technologies | 650-804-8657
> > >>
> >
>

--485b397dd68df476a004c9b4441f--