phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tulasi Paradarami <tulasi.krishn...@gmail.com>
Subject Re: Bulk-loader performance
Date Fri, 06 Mar 2015 22:21:09 GMT
Hi James,

Here is a break-up of percentage execution time for some of the steps in
the mapper:

csvParser: 18%
> csvUpsertExecutor.execute(ImmutableList.of(csvRecord)): 39%
> ´╗┐PhoenixRuntime.getUncommittedDataIterator(conn, true): 9%
> ´╗┐while (uncommittedDataIterator.hasNext()) {...}: 15%
> Read IO & custom processing: 19%


I couldn't find where UpsertCompiler.compile() is called - could you point
me to it?

Thanks

- Tulasi

On Thu, Mar 5, 2015 at 3:26 PM, James Taylor <jamestaylor@apache.org> wrote:

> Thanks for pursuing this, Tulasi. I'm sure there's room for
> improvement, but I think we need to get to the next level of detail to
> know where. Of the 40% execution time you mentioned, how much is spent
> in the CSV parse? For that, we rely on Apache Commons CSV, so probably
> not much we can do about that (short of using a different CSV parser
> or pinging that project for ideas).
>
> How about in UpsertCompiler.compile() - how much time is spent there?
> The conversion from the csvRecord to Phoenix encoding is very direct,
> but the compilation again and again could potentially be avoided.
>
> Thanks,
>
>     James
>
> On Thu, Mar 5, 2015 at 2:13 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
> > I don't know of any benchmarks vs. HBase bulk loader. Would be
> interesting,
> > if you could come up with an apples-to-apples test.
> >
> > 100TB binary file cannot be partitioned at all? You're always bound to a
> > single process. Bummer. I guess plan B could be pre-processing the binary
> > file into something splittable. You'll cover the data twice, but if
> Phoenix
> > encoding really is the current bottleneck, as your mail indicates, then
> > separating the decoding of the binary file from encoding of the Phoenix
> > output should allow for parallelizing the second step and improve the
> state
> > of things.
> >
> > Mean time, would be good to look at perf improvements of the Phoenix
> > encoding step. Any volunteers lurking about?
> >
> > -n
> >
> > On Thu, Mar 5, 2015 at 1:08 PM, Tulasi Paradarami <
> > tulasi.krishna.p@gmail.com> wrote:
> >
> >> Gabriel, Nick, thanks for your inputs. My comments below.
> >>
> >> Although it may look as though data is being written over the wire to
> >> > Phoenix, the execution of an upsert executor and retrieval of the
> >> > uncommitted KeyValues is all local (in memory). The code is
> implemented
> >> in
> >> > this way because JDBC is the general API used within Phoenix -- there
> >> isn't
> >> > direct "convert fields to Phoenix encoding" API, although this is
> doing
> >> the
> >> > equivalent operation.
> >>
> >> I understand, data processing is in memory but performance can be
> improved
> >> if there is a direct conversion to Phoenix encoding.
> >> Are there any performance comparison results between phoenix & hbase
> >> bulk-loader?
> >>
> >> Could you give some more information on your performance numbers? For
> >> > example, is this the throughput that you're getting in a single
> process,
> >> or
> >> > over a number of processes? If so, how many processes?
> >>
> >> Its currently running as a single mapper processing a binary file
> >> (un-splittable). Disk throughput doesn't look to be an issue here.
> >> Production has machines of the same processing capability but obviously
> >> more number of nodes and input files.
> >>
> >>
> >> Also, how many columns are in the records that you're loading?
> >>
> >> The row-size is small: 3 integers for PK, 2 short qualifiers, 1 varchar
> >> qualifier
> >>
> >> What is the current (projected) time required to load the data?
> >>
> >> About 20-25 days
> >>
> >>
> >> What is the minimum allowable ingest speed to be considered
> satisfactory?
> >>
> >> We would like to finish the load in less than 10-12 days.
> >>
> >>
> >> You can make things go faster by increasing the number of mappers.
> >>
> >> The input file (binary) is not-splittable, a mapper is tied to the
> specific
> >> file.
> >>
> >> What changes did you make to the map() method? Increased logging,
> >> > performance enhancements, plugging in custom logic, something else?
> >>
> >> I added custom logic to the map() method.
> >>
> >>
> >>
> >> On Thu, Mar 5, 2015 at 7:53 AM, Nick Dimiduk <ndimiduk@gmail.com>
> wrote:
> >>
> >> > Also: how large is your cluster? You can make things go faster by
> >> > increasing the number of mappers. What changes did you make to the
> map()
> >> > method? Increased logging, performance enhancements, plugging in
> custom
> >> > logic, something else?
> >> >
> >> > On Thursday, March 5, 2015, Gabriel Reid <gabriel.reid@gmail.com>
> wrote:
> >> >
> >> > > Hi Tulasi,
> >> > >
> >> > > Answers (and questions) inlined below:
> >> > >
> >> > > On Thu, Mar 5, 2015 at 2:41 AM Tulasi Paradarami <
> >> > > tulasi.krishna.p@gmail.com <javascript:;>>
> >> > > wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > Here are the details of our environment:
> >> > > > Phoenix 4.3
> >> > > > HBase 0.98.6
> >> > > >
> >> > > > I'm loading data to a Phoenix table using the csv bulk-loader
> (after
> >> > > making
> >> > > > some changes to the map(...) method) and it is processing about
> >> 16,000
> >> > -
> >> > > > 20,000 rows/sec. I noticed that the bulk-loader spends upto 40%
of
> >> the
> >> > > > execution time in the following steps.
> >> > >
> >> > >
> >> > > > //...
> >> > > > csvRecord = csvLineParser.parse(value.toString());
> >> > > > csvUpsertExecutor.execute(ImmutableList.of(csvRecord));
> >> > > > Iterator<Pair<byte[], List<KeyValue>>> uncommittedDataIterator
=
> >> > > > PhoenixRuntime.getUncommittedDataIterator(conn, true);
> >> > > > //...
> >> > > >
> >> > >
> >> > > The non-code translation of those steps is:
> >> > > 1. Parse the CSV record
> >> > > 2. Convert the contents of the CSV record into KeyValues
> >> > >
> >> > > Although it may look as though data is being written over the wire
> to
> >> > > Phoenix, the execution of an upsert executor and retrieval of the
> >> > > uncommitted KeyValues is all local (in memory). The code is
> implemented
> >> > in
> >> > > this way because JDBC is the general API used within Phoenix --
> there
> >> > isn't
> >> > > direct "convert fields to Phoenix encoding" API, although this is
> doing
> >> > the
> >> > > equivalent operation.
> >> > >
> >> > > Could you give some more information on your performance numbers?
> For
> >> > > example, is this the throughput that you're getting in a single
> >> process,
> >> > or
> >> > > over a number of processes? If so, how many processes? Also, how
> many
> >> > > columns are in the records that you're loading?
> >> > >
> >> > >
> >> > > >
> >> > > > We plan to load up-to 100TB of data and overall performance of
the
> >> > > > bulk-loader is not satisfactory.
> >> > > >
> >> > >
> >> > > How many records are in that 100TB? What is the current (projected)
> >> time
> >> > > required to load the data? What is the minimum allowable ingest
> speed
> >> to
> >> > be
> >> > > considered satisfactory?
> >> > >
> >> > > - Gabriel
> >> > >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message