accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Pastorelli <mario.pastore...@teralytics.ch>
Subject Feedback about techniques for tuning batch scanning for my problem
Date Thu, 19 May 2016 15:08:37 GMT
Hey people,
I'm trying to tune a bit the query performance to see how fast it can go
and I thought it would be great to have comments from the community. The
problem that I'm trying to solve in Accumulo is the following: we want to
store the entities that have been in a certain location in a certain day.
The location is a Long and the entity id is a Long. I want to be able to
scan ~1M of rows in few seconds, possibly less than one. Right now, I'm
doing the following things:

   1. I'm using a sharding byte at the start of the rowId to keep the data
   in the same range distributed in the cluster
   2. all the records are encoded, one single record is composed by
      1. rowId: 1 shard byte + 3 bytes for the day
      2. column family: 8 byte for the long corresponding to the hash of
      the location
      3. column qualifier: 8 byte corresponding to the identifier of the
      entity
      4. value: 2 bytes for some additional information
   3. I use a batch scanner because I don't need sorting and it's faster

As expected, it takes few seconds to scan 1M rows but now I'm wondering if
I can improve it. My ideas are the following:

   1. set table.compaction.major.ration to 1 because I don't care about the
   ingestion performance and this should improve the query performance
   2. pre-split tables to match the number of servers and then use a byte
   of shard as first byte of the rowId. This should improve both writing and
   reading the data because both should work in parallel for what I understood
   3. enable bloom filter on the table

Do you think those ideas make sense? Furthermore, I have two questions:

   1. considering that a single entry is only 22 bytes but I'm going to
   scan ~1M records per query, do you think I should change the BatchScanner
   buffers somehow?
   2. anything else to improve the scan speed? Again, I don't care about
   the ingestion time

Thanks for the help!

-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: mario.pastorelli@teralytics.ch
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

Mime
View raw message