accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sukant Hajra" <>
Subject Cost of scanner usage in a MapReduce mapper?
Date Thu, 01 Nov 2012 03:35:26 GMT
Today I had a really nice conversation with billie and vines on #accumulo.
This email is a followup to that conversation, and there's a little more
context of my problem here.

We have an application that we've developed independently from MapReduce.  To
get away from the low-level keys and values of Accumulo, we quickly made a
series of DAOs that each take in an Accumulo Instance as a constructor
argument.  These DAOs internally create the necessary scanners and return
domain-specific objects.  I imagine this is a common practice.

Now, we've got a feature that needs to operate on all the data, so we're doing
some MapReduce.  I think I understand now the architecture of
AccumuloInputFormat from discussions on #accumulo.  What I didn't discuss was
whether it was reasonable (or not reasonable because of the performance cost)
to try to use one of our DAOs within a mapper.

The mappers need to operate per row, and our system has potentially billions of
rows.  With my DAOs, I can reuse the same Accumulo instance, but each call
will create a new scanner from my instance, so a MapReduce job using a DAO in
the mappers will potentially create billions of scanners over the course of
operation.   However, the way we've designed these DAOs, it's easy to make sure
all accesses are tied to the row the mapper is tasked with (in an attempt to
maintain data locality).

By comparison.  I feel the AccumuloInputFormat will create about as many
Accumulo scanners as there are tablet servers, so dramatically less.

Our current thinking is that creating billions scanners with these DAO accesses
might cost too much in performance, but we're not completely sure this is the
case with respect to the kind of caching Accumulo does with its clients.

If the performance cost is indeed too high, then we're going to have to deal
with the abstraction challenge of trying to avoid code duplication between our
DAOs and our MapReduce jobs.

Thanks for your feedback,

View raw message