accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sven Hodapp <sven.hod...@scai.fraunhofer.de>
Subject Re: Accumulo Seek performance
Date Wed, 31 Aug 2016 07:06:12 GMT
Hi Keith,

I've tried it with 1, 2 or 10 threads. Unfortunately there where no amazing differences.
Maybe it's a problem with the table structure? For example it may happen that one row id (e.g.
a sentence) has several thousand column families. Can this affect the seek performance?

So for my initial example it has about 3000 row ids to seek, which will return about 500k
entries. If I filter for specific column families (e.g. a document without annotations) it
will return about 5k entries, but the seek time will only be halved.
Are there to much column families to seek it fast?

Thanks!

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hodapp@scai.fraunhofer.de
www.scai.fraunhofer.de

----- Urspr√ľngliche Mail -----
> Von: "Keith Turner" <keith@deenlo.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Montag, 29. August 2016 22:37:32
> Betreff: Re: Accumulo Seek performance

> On Wed, Aug 24, 2016 at 9:22 AM, Sven Hodapp
> <sven.hodapp@scai.fraunhofer.de> wrote:
>> Hi there,
>>
>> currently we're experimenting with a two node Accumulo cluster (two tablet
>> servers) setup for document storage.
>> This documents are decomposed up to the sentence level.
>>
>> Now I'm using a BatchScanner to assemble the full document like this:
>>
>>     val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // ARTIFACTS table
>>     currently hosts ~30GB data, ~200M entries on ~45 tablets
>>     bscan.setRanges(ranges)  // there are like 3000 Range.exact's in the ranges-list
>>       for (entry <- bscan.asScala) yield {
>>         val key = entry.getKey()
>>         val value = entry.getValue()
>>         // etc.
>>       }
>>
>> For larger full documents (e.g. 3000 exact ranges), this operation will take
>> about 12 seconds.
>> But shorter documents are assembled blazing fast...
>>
>> Is that to much for a BatchScanner / I'm misusing the BatchScaner?
>> Is that a normal time for such a (seek) operation?
>> Can I do something to get a better seek performance?
> 
> How many threads did you configure the batch scanner with and did you
> try varying this?
> 
>>
>> Note: I have already enabled bloom filtering on that table.
>>
>> Thank you for any advice!
>>
>> Regards,
>> Sven
>>
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hodapp@scai.fraunhofer.de
> > www.scai.fraunhofer.de

Mime
View raw message