kylin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "蒋旭" <jiangxu.ch...@qq.com>
Subject 回复:a few slides for Strata + Hadoop World London 2015
Date Sun, 03 May 2015 09:14:01 GMT
Since grid table will split data block by time range, the record size is closely correlated
with block size. So, we have to keep small block to reduce record size. I suggest to use coarse
granularity timestamp (e.g. 1 day) to split block and use fine granularity timestamp (e.g.
1 minute) to split record. 


The hbase rowkey can be "coarse granularity  timestamp"  + "dimension:value" + "fine timestamp
granularity". For one query, we can decide how many block (i.e. "coarse granularity timestamp")
should be randomly scan; then we can sequently scan each block (i.e. "dimension:value" + "fine
timestamp granularity").  



We can enhance the grid table: use timestamp-prefix to split block and use timestamp-postfix
to split record.
 
------------------ 原始邮件 ------------------
发件人: "蒋旭";<jiangxu.china@qq.com>;
发送时间: 2015年5月3日(星期天) 上午10:41
收件人: "Li Yang"<liyang@apache.org>; 
抄送: "dev"<dev@kylin.incubator.apache.org>; 
主题: 回复:a few slides for Strata + Hadoop World London 2015



Hi Yang,
I have some questions about this deck.


1. Basically, grid table is "split data block by time" + "secondary block index", which is
more suitable for inverted index than data cube. As data cube is multi-dimension array and
timestamp is just one dimension, it's difficult to be spilt into block by timestamp.


2. Grid table is more suitable for small-size data in memory and it's unsuitable for large-size
data on disk. When the data size is very large, we have to keep large data block. For frequently
term that will exist in most data block, we almost have to scan all blocks. 


3. For hbase, the key point of optimization is reduce the scan range or skip the scan range.
The grid table is "full timestamp + term as rowkey & secondary block index" that have
to scan large range for big time range query. I suggest to adopt "coarse timestamp + term
+ fine timestamp" as rowkey design that is more useful to reduce and skip scan range.  


4. How about the design of TopN query on ultra high cardinality dimension? Will we support
it both in inverted index and in data cube? Do we keep the "ultra high cardinality dimension"
as dimension or metrics? 


Thanks
JiangXu




------------------ 原始邮件 ------------------
发件人: "Li Yang";<liyang@apache.org>;
发送时间: 2015年5月2日(星期六) 上午8:20
收件人: "dev"<dev@kylin.incubator.apache.org>; 

主题: a few slides for Strata + Hadoop World London 2015



Hi Luke


I created a few slides for Strata + Hadoop World London 2015 next week, see attached. Let's
see how they merge with previous deck.


Some should attach to related JIRA as design doc. I'll do it later.


Cheers

Yang
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message