incubator-drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Azuryy Yu <azury...@gmail.com>
Subject Re: Storage file format
Date Sun, 16 Sep 2012 00:07:28 GMT
there should be two seperate topics here:
1) storage file format
2) DFS

because we should support map/reduce output data to Drill,(maybe this is
the only way for Drill to load data)

for the second topic, I mentioned in this thread, I prefer Mapr DFS, which
is really HA.

as for the first topic, we should try to find mature open source project
and do some modification to fit for us.



On Sun, Sep 16, 2012 at 5:11 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> There is no project-wide roadmap in a real open source project.
>
> There are vision documents that various people use to try to motivate
> consensus.
>
> There are also individual roadmaps that describe what the individual
> contributors plan to do.
>
> Power Drill style in memory data is definitely intriguing and once Drill
> works and works fast on simpler structures, I would expect that somebody
> would be interested in implementing it.
>
> Perhaps that would be you?
>
> On Sat, Sep 15, 2012 at 10:16 AM, Tsuyoshi OZAWA
> <ozawa.tsuyoshi@gmail.com>wrote:
>
> > Hello,
> >
> > Is there a roadmap to suppor in-memory index and storage like
> > PowerDrill? It's one kind of storage, though its format is different
> > from the columnar storage format in Dremel paper as you mentioned.
> >
> > IMO, the in-memory index and storage are much useful for analysis with
> > small cluster.
> >
> > Thanks,
> > - Tsuyoshi
> >
> > On Sun, Sep 16, 2012 at 2:02 AM, Dharm Raj <dharmrajbaliyan@gmail.com>
> > wrote:
> > > You are right Camuel. While thinking  storage format I was thinking
> about
> > > append. Misplaced update.
> > >
> > > On Sat, Sep 15, 2012 at 9:49 PM, Camuel Gilyadov <camuel@gmail.com>
> > wrote:
> > >
> > >> Drill doesn't support updates. It is append only data store and append
> > is
> > >> usually expected to be a nice data chunk not a single row
> > >>
> > >> On Sat, Sep 15, 2012 at 8:09 AM, Dharm Raj <dharmrajbaliyan@gmail.com
> > >> >wrote:
> > >>
> > >> > For columnar storage, IMO each column can be managed in a separate
> > file.
> > >> > Dremel also seems to have each column in a separate file. This
> should
> > be
> > >> > easy to manage and update are possible. Please see
> > >> > https://issues.apache.org/jira/browse/AVRO-806
> > >> >
> > >> > Drill architecture slides shows AVRO-806 and trevni in Column
> storage
> > >> box.
> > >> > Are we looking them as candidate for storage format for drill?
> > >> >
> > >> > If we have lot of data with high amount of sparsity and major use
> > case is
> > >> > to read only once data is written - Another way could be to store
> in a
> > >> > column major sparse matrix format. It  looks easy to implement but
> > >> updates
> > >> > may be problematic. just a thought.
> > >> >
> > >> > Regards,
> > >> > Dharm
> > >> >
> > >> > On Sat, Sep 15, 2012 at 7:24 PM, NAVEEN MAANJU <
> > >> > naveen.maanju.apache@gmail.com> wrote:
> > >> >
> > >> > > make sense..
> > >> > >
> > >> > > On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <
> ted.dunning@gmail.com
> > >
> > >> > > wrote:
> > >> > >
> > >> > > > The key goal here is to get something simple working quickly
in
> a
> > way
> > >> > > that
> > >> > > > allows additional, more advanced implementations.
> > >> > > >
> > >> > > > On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <
> > leemoonsoo@gmail.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > for column-storage, how about leverage Hbase or Accumulo?
> > >> > > > >
> > >> > > > > they'll also give a chance to data update (future work?)
> > >> > > > >
> > >> > > > >
> > >> > > > > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <
> azuryyyu@gmail.com>
> > >> > wrote:
> > >> > > > >
> > >> > > > > > Hi All,
> > >> > > > > >
> > >> > > > > > I am interested in working on storage format.
(sign up?)
> > >> > > > > >
> > >> > > > > > I wrote a HDFS  file format, which is similar
to Sequence
> file
> > >> (row
> > >> > > > > > storage, block management, compress), I provide
InputFormat
> > and
> > >> > > > > > OutputFormat,
> > >> > > > > >
> > >> > > > > > sometimes it get a great performance, sometimes
not, depends
> > on
> > >> the
> > >> > > > data.
> > >> > > > > >
> > >> > > > > > for Drill, we should implement a column-storage,
this can
> skip
> > >> some
> > >> > > > > columns
> > >> > > > > > during query, and skip some rows within one column
file. but
> > this
> > >> > > > > > column-storage should based on the distributed
file system,
> > such
> > >> as
> > >> > > > HDFS,
> > >> > > > > > Mapr DFS, I like Mapr DFS because of HA.
> > >> > > > > >
> > >> > > > > > we can implement the following column storage
file format, I
> > >> think
> > >> > > it's
> > >> > > > > > enough to us.
> > >> > > > > >
> > >> > > > > > http://arxiv.org/pdf/1105.4252.pdf
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message