accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <rwe...@newbrightidea.com>
Subject Re: Bulk Ingest
Date Fri, 17 Jun 2016 03:40:35 GMT
> 1) Avoid lots of small files. Target as large of files as you can, relative
to your ingest latency requirements and your max file size (set on your
instance or table)

If you're using Spark to produce the RFiles, one trick for this is to call
coalesce() on your RDD to reduce the number of RFiles that are written to
HDFS.

> 2) Avoid having to import one file to multiple tablets.

This is huge. Again, if you're using Spark you must not use the
HashPartitioner to create RDDs or you'll wind up in a situation where every
tablet owns a piece of every RFile. Ideally you would use something like
the GroupedKeyPartitioner[1] to align the RDD partitions with the tablet
splits but even the built-in RangePartitioner will be much better than the
HashPartitioner.

-Russ

On Thu, Jun 16, 2016 at 7:24 PM Josh Elser <josh.elser@gmail.com> wrote:

> There are two big things that are required to really scale up bulk
> loading. Sadly (I guess) they are both things you would need to be
> implement on your own:
>
> 1) Avoid lots of small files. Target as large of files as you can,
> relative to your ingest latency requirements and your max file size (set
> on your instance or table)
>
> 2) Avoid having to import one file to multiple tablets. Remember that
> the majority of the metadata update for Accumulo is updating the tablet
> row with the new file. When you have one file which spans many tablets,
> you are now create N metadata updates instead of just one. When you
> create the files, take into account the split points of your table, and
> use that try to target one file per tablet.
>
> Roshan Punnoose wrote:
> > We are trying to perform bulk ingest at scale and wanted to get some
> > quick thoughts on how to increase performance and stability. One of the
> > problems we have is that we sometimes import thousands of small files,
> > and I don't believe there is a good way around this in the architecture
> > as of yet. Already I have run into an rpc timeout issue because the
> > import process is taking longer than 5m. And another issue where we have
> > so many files after a bulk import that we have had to bump the
> > tserver.scan.files.open.max to 1K.
> >
> > Here are some other configs that we have been toying with:
> > - master.fate.threadpool.size: 20
> > - master.bulk.threadpool.size: 20
> > - master.bulk.timeout: 20m
> > - tserver.bulk.process.threads: 20
> > - tserver.bulk.assign.threads: 20
> > - tserver.bulk.timeout: 20m
> > - tserver.compaction.major.concurrent.max: 20
> > - tserver.scan.files.open.max: 1200
> > - tserver.server.threads.minimum: 64
> > - table.file.max: 64
> > - table.compaction.major.ratio: 20
> >
> > (HDFS)
> > - dfs.namenode.handler.count: 100
> > - dfs.datanode.handler.count: 50
> >
> > Just want to get any quick ideas for performing bulk ingest at scale.
> > Thanks guys
> >
> > p.s. This is on Accumulo 1.6.5
>

Mime
View raw message