accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Cost of scanner usage in a MapReduce mapper?
Date Thu, 01 Nov 2012 08:20:20 GMT

For clarification, are you trying to create DAO's from the Key/Value fed
to a Mapper by AccumuloInputFormat, or are you trying to process a
different data set while simultaneously querying your DAO's?

> Today I had a really nice conversation with billie and vines on #accumulo.
> This email is a followup to that conversation, and there's a little more
> context of my problem here.
> We have an application that we've developed independently from MapReduce.
> To
> get away from the low-level keys and values of Accumulo, we quickly made a
> series of DAOs that each take in an Accumulo Instance as a constructor
> argument.  These DAOs internally create the necessary scanners and return
> domain-specific objects.  I imagine this is a common practice.
> Now, we've got a feature that needs to operate on all the data, so we're
> doing
> some MapReduce.  I think I understand now the architecture of
> AccumuloInputFormat from discussions on #accumulo.  What I didn't discuss
> was
> whether it was reasonable (or not reasonable because of the performance
> cost)
> to try to use one of our DAOs within a mapper.
> The mappers need to operate per row, and our system has potentially
> billions of
> rows.  With my DAOs, I can reuse the same Accumulo instance, but each call
> will create a new scanner from my instance, so a MapReduce job using a DAO
> in
> the mappers will potentially create billions of scanners over the course
> of
> operation.   However, the way we've designed these DAOs, it's easy to make
> sure
> all accesses are tied to the row the mapper is tasked with (in an attempt
> to
> maintain data locality).
> By comparison.  I feel the AccumuloInputFormat will create about as many
> Accumulo scanners as there are tablet servers, so dramatically less.
> Our current thinking is that creating billions scanners with these DAO
> accesses
> might cost too much in performance, but we're not completely sure this is
> the
> case with respect to the kind of caching Accumulo does with its clients.
> If the performance cost is indeed too high, then we're going to have to
> deal
> with the abstraction challenge of trying to avoid code duplication between
> our
> DAOs and our MapReduce jobs.

View raw message