hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@gmail.com>
Subject Re: Best practice for writing to HFileOutputFormat(2) with multiple Column Families
Date Fri, 01 Aug 2014 16:51:34 GMT
You're asking whether it's more time efficient to do a single "universal
sort" of all the data vs first doing a group by cf and sorting each group
individually? Thats like a question more appropriate for the spark user
list.

-n


On Wed, Jul 30, 2014 at 8:01 PM, Jianshi Huang <jianshi.huang@gmail.com>
wrote:

> I need to generate from a 2TB dataset and exploded it to 4 Column Families.
>
> The result dataset is likely to be 20TB or more. I'm currently using Spark
> so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to
> optimize it.
>
> My question is:
> Should I sort and write each column family one by one, or should I put them
> all together then do sort and write?
>
> Does my question make sense?
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message