hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry He <jerry...@gmail.com>
Subject Re: HFile vs Parquet for very wide table
Date Fri, 22 Jan 2016 22:52:57 GMT
Parquet may be more efficient in your use case, coupled with a upper layer
query engine.
But Parquet has schema. Schema can evolve though. e.g. adding columns in
new Parquet files.
HBase would be able to do the job too, and it schema-less -- you can add
columns freely.

Jerry

On Fri, Jan 22, 2016 at 10:04 AM, Krishna <research800@gmail.com> wrote:

> Thanks Ted, Jerry.
>
> Computing pairwise similarity is the primary purpose of the matrix. This is
> done by extracting all rows for a set of columns at each iteration.
>
> On Thursday, January 21, 2016, Jerry He <jerryjch@gmail.com> wrote:
>
> > What do you want to do with your matrix data?  How do you want to use it?
> > Do you need random read/write or point query?  Do you need to get the
> > row/record or many many columns at a time?
> > If yes, HBase is a good choice for you.
> > Parquet is good as a storage format for large scans, aggregations, on
> > limited number of specific columns. Analytical type of work.
> >
> > Jerry
> >
> >
> >
> >
> > On Thu, Jan 21, 2016 at 3:25 PM, Ted Yu <yuzhihong@gmail.com
> > <javascript:;>> wrote:
> >
> > > I have very limited knowledge on Parquet, so I can only answer from
> HBase
> > > point of view.
> > >
> > > Please see recent thread on number of columns in a row in HBase:
> > >
> > > http://search-hadoop.com/m/YGbb3NN3v1jeL1f
> > >
> > > There're a few Spark hbase connectors.
> > > See this thread:
> > >
> > > http://search-hadoop.com/m/q3RTt4cp9Z4p37s
> > >
> > > Sorry I cannot answer performance comparison question.
> > >
> > > Cheers
> > >
> > > On Thu, Jan 21, 2016 at 2:43 PM, Krishna <research800@gmail.com
> > <javascript:;>> wrote:
> > >
> > > > We are evaluating Parquet and HBase for storing a dense & very, very
> > wide
> > > > matrix (can have more than 600K columns).
> > > >
> > > > I've following questions:
> > > >
> > > >    - Is there is a limit on # of columns in Parquet or HFile? We
> expect
> > > to
> > > >    query [10-100] columns at a time using Spark - what are the
> > > performance
> > > >    implications in this scenario?
> > > >    - HBase can support millions of columns - anyone with prior
> > experience
> > > >    that compares Parquet vs HFile performance for wide structured
> > tables?
> > > >    - We want a schema-less solution since the matrix can get wider
> > over a
> > > >    period of time
> > > >    - Is there a way to generate wide structured schema-less Parquet
> > files
> > > >    using map-reduce (input files are in custom binary format)?
> > > >
> > > > What other solutions other than Parquet & HBase are useful for this
> > > > use-case?
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message