accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: scan command hung
Date Tue, 06 Oct 2015 19:22:45 GMT


z11373 wrote:
> Thanks Billie/Josh! That's indeed fixing the issue, the scan now returns
> instantly!!
>
> So when we scan the whole table and filtering by column family, Accumulo
> still has to go through all rows (ordered by the key), and check if the
> particular item has specific column family, and in my case since they are
> intermingled, the data I am looking for could be somewhere in the middle or
> in the end of the rfile, am I right?
>
> I did another experiment, if I specify -b and -e, then it also returned
> instantly (this before I moved them to different group and compact), which
> does make sense, because Accumulo could narrow down to specific ranges, and
> then filter them by column family.
>
> I have another follow up question, does it mean I have to create new
> locality group for each column family since I wouldn't know how big/small
> the data belong to that cf in advance?
>
> Btw, we shard the customers by putting their id as column family, so we'll
> add new column family whenever new customer onboard. I think the case which
> we have to scan the table with cf without specifying ranges may be rare (or
> perhaps never, except if I run it from shell), but I am worried if this can
> become perf bottleneck if I don't set them to separate locality group.

This strikes me as very odd. Sharding is the process of distribution 
some data set across multiple nodes. The only way this is done in 
Accumulo is by the row, not the column family. If you want fast, 
point-lookups by customer, you'd want this customer ID in the row. If 
that's a non-starter for some reason, this is a case where you'd want to 
implement a secondary index (usually as a separate table) that does have 
the customer ID in the row which then points to the row+colfam in your 
"data" table.

e.g. say your data is sharded/hashed/whatever by date.

20151006_1 cust_id_1:attr1 => value
20151006_1 cust_id_1:attr2 => value

You would make a second table which has something like

cust_id_1 : => 20151006_1

Where you have an empty colfam/colqual. There are ways you could also 
use these extra field to perform extra filtering.

Ultimately, locality groups are meant to have coarse grouping of "types 
of data" together rather than quick random access over an entire 
dataset. Does that make sense?

> Another question, when running setgroups command, it looks like I have to
> set for all of them, even I just added new cf. For example, let say I did:
> setgroups mygroup=cf1,cf2 -t mytable
> compact -t mytable -w
>
> Then later I need to add cf3 to the same group, I have to do "setgroups
> mygroup=cf1,cf2,c3 -t mytable", instead of just "setgroups mygroup=cf3 -t
> mytable"
>
> It'd be nice if I can do the latter :-) What happens with cf1 and cf2 if I
> did the latter, does it mean they are coming back to default group again
> after compaction?
>
>
> Thanks,
> Z
>
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/scan-command-hung-tp15286p15324.html
> Sent from the Developers mailing list archive at Nabble.com.

Mime
View raw message