hawq-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Boudnik <...@apache.org>
Subject Re: hawq peerformance on 10 billion rows table
Date Fri, 29 Jan 2016 16:56:51 GMT
billions with 'B': looks like MongoDB is web-scale after all!

On Fri, Jan 29, 2016 at 11:59AM, Alexey Grishchenko wrote:
> The main thing to consider for you is that HAWQ does not have indexes. So
> the only way to limit the amount of data it scans is to use partitioning +
> columnar tables (Parquet)
> In contrast, Greenplum has indexes, and if your query returns 100s of
> records from 10'000'000'000 rows table it might be a good thing for you.
> But you should be careful here - if you have "where" conditions on
> different columns you might end up building many indexes, which would lead
> you to the situation where index size for the table is greater than the
> size of its data
> 
> On Fri, Jan 29, 2016 at 10:26 AM, 陶进 <tonytao0505@outlook.com> wrote:
> 
> > hi Martin,
> >
> > Mnay thanks for your kindly help.
> >
> > I could  find little performance case of greenplum/hawq on
> > google,especialy on 10 billion row data.Your replying inspire confidence in
> > me. :-)
> >
> > our real-time query only returns hundreds of row from a huge table. I'll
> > test and tuning HAWQ after our machines are avaliable to approve the
> > performance.
> >
> > Thank you again for your promptly repling.
> >
> >
> > Best regards!
> >
> > Tony.
> >
> >
> >
> > 在 2016/1/29 17:29, Martin Visser 写道:
> >
> > Hi,
> >
> > for queries like that there are a couple of functionalities of HAWQ that
> > will help you.  One is columnar storage like Parquet.  This will help you
> > when you are only selecting columns a,b,c and the table has columns
> > a,b,...z   The other functionality that will help you is partitioning to
> > reduce the initial set without having to read the data.  How to choose the
> > partitioning will depend on your query patterns and the selectivity of the
> > column values.  For example in your query you could partition on column a.
> > But as mentioned if a only had values 1 and 2 that would only half the
> > number of rows being scanned etc.
> >
> > Another observation is that you are selecting individual rows in your
> > example rather than grouped results. Potentially this could result in a lot
> > of data having to be returned by the query.  Is that the case?  How many
> > rows would you expect queries to return?
> >
> > The answer for your 10 seconds is it is certainly possible due to HAWQs
> > linear scalability but it depends on a number of factors.
> >
> > hth
> > Martin
> >
> > On Fri, Jan 29, 2016 at 5:34 AM, 陶进 <tonytao0505@outlook.com> wrote:
> >
> >> hi guys,
> >>
> >> We have several huge tables,and some of the table would more than 10
> >> billion rows.each table had the same  columns,each row is about 100 Byte.
> >>
> >> Our query run on each  singal table to filter and sort some records,such
> >> as select a,b,c from t where a=1 b='hello' order by 1,2.
> >>
> >> Now we use mongodb,and the bigest table had 4 billion rows.it could
> >> returned in 10 seconds.Now we want to use hawq as our query engine.Could
> >> they run the above query in 10 seconds?  what the hardware of the
> >> server?how many node would need?
> >>
> >>
> >> Thanks.
> >>
> >> ---
> >> Avast 防毒软件已对此电子邮件执行病毒检查。
> >> https://www.avast.com/antivirus
> >>
> >>
> >
> >
> > <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
受
> > Avast 保护的无病毒计算机已发送该电子邮件。
> > www.avast.com
> > <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> >
> 
> 
> 
> -- 
> Best regards,
> Alexey Grishchenko

Mime
View raw message