community-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 子落 <yannian...@taobao.com>
Subject Introduce mdrill project(opensource,maybe help full for apache drill`s develope)
Date Mon, 12 Aug 2013 09:25:46 GMT
it`s address is https://github.com/alibaba/mdrill ,i think some of the
information or desion maybe help full for apache drill dev.

 

Which is like apache drill or google power drill, it is base on
hadoop,lucene,solr,jstorm

 

Now in my project , has 10 tables, 47760506482 rows ,80~400columns. (run on
10 mathines, permachine ram:48GB,12*2TB disk)

 

Some of the search example.,like bellows:

 

select count(*) from r_rpt_cps_luna_item where thedate >='20130416' and
thedate <'20130811' limit 0,100

  _____  

totalRecords:1


count(*)


11108914892

times taken 4.031 seconds

 

 

select sum(landing_uv) from r_rpt_cps_luna_item where thedate >='20130416'
and  thedate <'20130811' limit 0,100

  _____  

totalRecords:1


sum(landing_uv)


2.07678497E8

times taken 56.081 seconds

 

select dist(user_id) from r_rpt_cps_luna_item where thedate >='20130416' and
thedate <'20130811' limit 0,100

  _____  

totalRecords:1


dist(user_id)


1483008.0

times taken 246.147 seconds

 

select thedate,count(*) as cnt from r_rpt_cps_luna_item where thedate
>='20130416' and  thedate <'20130811' group by thedate order by cnt desc
limit 0,3

  _____  

totalRecords:118


thedate

cnt


20130803

158301304


20130802

157748487


20130725

157047045

times taken 34.727 seconds

 

select thedate,user_id,count(*) as cnt from r_rpt_cps_luna_item where
thedate >='20130416' and  thedate <'20130811' group by thedate,user_id order
by cnt desc limit 0,3

  _____  

totalRecords:10010


thedate

user_id

cnt


20130725

725677994

194397


20130725

101450072

192650


20130701

101450072

189107

times taken 149.316 seconds

 

select thedate,category_level1,count(*) as cnt from r_rpt_cps_luna_item
where thedate >='20130416' and  thedate <'20130811' group by
thedate,category_level1 order by cnt desc limit 0,3

  _____  

totalRecords:10010


thedate

category_level1

cnt


20130803

16

26487658


20130802

16

26306163


20130725

16

26128576

times taken 94.989 seconds

 

select thedate,category_level1,category_level2,count(*) as cnt from
r_rpt_cps_luna_item where thedate >='20130416' and  thedate <'20130811'
group by thedate,category_level1,category_level2 order by cnt desc limit 0,3

  _____  

totalRecords:10010


thedate

category_level1

category_level2

cnt


20130725

16

50010850

7315606


20130803

16

50010850

7006255


20130802

16

50010850

6936059

times taken 288.885 seconds

 

 

chinese introduce
1:mdrill旨在帮助用户在几秒到几十秒的时间内,分析百亿级别的任意维度组合的数
据。
2:mdrill是一个分布式的在线分析查询系统,基于hadoop,lucene,solr,jstorm等开源
系统作为实现,基于SQL的查询语法。 mdrill是一个能够对大量数据进行分布式处理的
软件框架。mdrill是快速的高性能的,他的底层因使用了索引、列式存储、以及内存
cache等技 术,使得数据扫描的速度大为增加。mdrill是分布式的,它以并行的方式工
作,通过并行处理加快处理速度。
3:基于mdrill应用的adhoc项目,使用了10台机器,存储了400亿的数据
  ==>每次扫描30亿的行数,响应时间在20秒~120秒左右(取决不同的查询条件与扫描的
列数)。
  ==>对100亿数据进行count(*),耗时为2秒,单列sum耗时在25秒,按照日期分组求
count和sum耗时47秒,按照用户id分组并且按照成交笔数排序去TopN 耗时
243秒。


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message