accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <>
Subject bulk load architecture
Date Mon, 15 Aug 2016 17:49:06 GMT
I've been looking through the bulk load code lately related to some
performance issues a customer of ours is experiencing, and I'm perplexed by
a couple of things. Between o.a.a.master.tableOps.LoadFiles and
o.a.a.server.client.BulkImporter we have 4 thread pools that are used in
bulk load. It seems like only the master thread pool gets any parallelism
because we always send one file at a time to the tservers (LoadFiles:154).
Are the three thread pools in the tserver vestigial? Did we used to send
bigger batches to the tservers and find that one at a time was more optimal?

Seems like we could greatly simplify the tserver portion of the bulk load.
Can anybody think of why that might not be a good idea?

Also, has anybody optimized the pool sizes for multiple concurrent large
bulk loads, and do you have suggestions on what settings to use (i.e.
master.fate.threadpool.size and master.bulk.threadpool.size)?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message