phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Bulk-loader performance
Date Mon, 09 Mar 2015 06:02:00 GMT
Thanks, Tulasi. That's helpful. I filed PHOENIX-1711 and attached a
patch that should reduce the csvUpsertExecutor.execute() time (that's
where the UpsertCompiler.compile() is called). It's still under
testing, but if you want to give it a shot and let us know if it
improves things, that'd be much appreciated.

    James



On Fri, Mar 6, 2015 at 2:21 PM, Tulasi Paradarami
<tulasi.krishna.p@gmail.com> wrote:
> Hi James,
>
> Here is a break-up of percentage execution time for some of the steps in
> the mapper:
>
> csvParser: 18%
>> csvUpsertExecutor.execute(ImmutableList.of(csvRecord)): 39%
>> ´╗┐PhoenixRuntime.getUncommittedDataIterator(conn, true): 9%
>> ´╗┐while (uncommittedDataIterator.hasNext()) {...}: 15%
>> Read IO & custom processing: 19%
>
>
> I couldn't find where UpsertCompiler.compile() is called - could you point
> me to it?
>
> Thanks
>
> - Tulasi
>
> On Thu, Mar 5, 2015 at 3:26 PM, James Taylor <jamestaylor@apache.org> wrote:
>
>> Thanks for pursuing this, Tulasi. I'm sure there's room for
>> improvement, but I think we need to get to the next level of detail to
>> know where. Of the 40% execution time you mentioned, how much is spent
>> in the CSV parse? For that, we rely on Apache Commons CSV, so probably
>> not much we can do about that (short of using a different CSV parser
>> or pinging that project for ideas).
>>
>> How about in UpsertCompiler.compile() - how much time is spent there?
>> The conversion from the csvRecord to Phoenix encoding is very direct,
>> but the compilation again and again could potentially be avoided.
>>
>> Thanks,
>>
>>     James
>>
>> On Thu, Mar 5, 2015 at 2:13 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>> > I don't know of any benchmarks vs. HBase bulk loader. Would be
>> interesting,
>> > if you could come up with an apples-to-apples test.
>> >
>> > 100TB binary file cannot be partitioned at all? You're always bound to a
>> > single process. Bummer. I guess plan B could be pre-processing the binary
>> > file into something splittable. You'll cover the data twice, but if
>> Phoenix
>> > encoding really is the current bottleneck, as your mail indicates, then
>> > separating the decoding of the binary file from encoding of the Phoenix
>> > output should allow for parallelizing the second step and improve the
>> state
>> > of things.
>> >
>> > Mean time, would be good to look at perf improvements of the Phoenix
>> > encoding step. Any volunteers lurking about?
>> >
>> > -n
>> >
>> > On Thu, Mar 5, 2015 at 1:08 PM, Tulasi Paradarami <
>> > tulasi.krishna.p@gmail.com> wrote:
>> >
>> >> Gabriel, Nick, thanks for your inputs. My comments below.
>> >>
>> >> Although it may look as though data is being written over the wire to
>> >> > Phoenix, the execution of an upsert executor and retrieval of the
>> >> > uncommitted KeyValues is all local (in memory). The code is
>> implemented
>> >> in
>> >> > this way because JDBC is the general API used within Phoenix -- there
>> >> isn't
>> >> > direct "convert fields to Phoenix encoding" API, although this is
>> doing
>> >> the
>> >> > equivalent operation.
>> >>
>> >> I understand, data processing is in memory but performance can be
>> improved
>> >> if there is a direct conversion to Phoenix encoding.
>> >> Are there any performance comparison results between phoenix & hbase
>> >> bulk-loader?
>> >>
>> >> Could you give some more information on your performance numbers? For
>> >> > example, is this the throughput that you're getting in a single
>> process,
>> >> or
>> >> > over a number of processes? If so, how many processes?
>> >>
>> >> Its currently running as a single mapper processing a binary file
>> >> (un-splittable). Disk throughput doesn't look to be an issue here.
>> >> Production has machines of the same processing capability but obviously
>> >> more number of nodes and input files.
>> >>
>> >>
>> >> Also, how many columns are in the records that you're loading?
>> >>
>> >> The row-size is small: 3 integers for PK, 2 short qualifiers, 1 varchar
>> >> qualifier
>> >>
>> >> What is the current (projected) time required to load the data?
>> >>
>> >> About 20-25 days
>> >>
>> >>
>> >> What is the minimum allowable ingest speed to be considered
>> satisfactory?
>> >>
>> >> We would like to finish the load in less than 10-12 days.
>> >>
>> >>
>> >> You can make things go faster by increasing the number of mappers.
>> >>
>> >> The input file (binary) is not-splittable, a mapper is tied to the
>> specific
>> >> file.
>> >>
>> >> What changes did you make to the map() method? Increased logging,
>> >> > performance enhancements, plugging in custom logic, something else?
>> >>
>> >> I added custom logic to the map() method.
>> >>
>> >>
>> >>
>> >> On Thu, Mar 5, 2015 at 7:53 AM, Nick Dimiduk <ndimiduk@gmail.com>
>> wrote:
>> >>
>> >> > Also: how large is your cluster? You can make things go faster by
>> >> > increasing the number of mappers. What changes did you make to the
>> map()
>> >> > method? Increased logging, performance enhancements, plugging in
>> custom
>> >> > logic, something else?
>> >> >
>> >> > On Thursday, March 5, 2015, Gabriel Reid <gabriel.reid@gmail.com>
>> wrote:
>> >> >
>> >> > > Hi Tulasi,
>> >> > >
>> >> > > Answers (and questions) inlined below:
>> >> > >
>> >> > > On Thu, Mar 5, 2015 at 2:41 AM Tulasi Paradarami <
>> >> > > tulasi.krishna.p@gmail.com <javascript:;>>
>> >> > > wrote:
>> >> > >
>> >> > > > Hi,
>> >> > > >
>> >> > > > Here are the details of our environment:
>> >> > > > Phoenix 4.3
>> >> > > > HBase 0.98.6
>> >> > > >
>> >> > > > I'm loading data to a Phoenix table using the csv bulk-loader
>> (after
>> >> > > making
>> >> > > > some changes to the map(...) method) and it is processing
about
>> >> 16,000
>> >> > -
>> >> > > > 20,000 rows/sec. I noticed that the bulk-loader spends upto
40% of
>> >> the
>> >> > > > execution time in the following steps.
>> >> > >
>> >> > >
>> >> > > > //...
>> >> > > > csvRecord = csvLineParser.parse(value.toString());
>> >> > > > csvUpsertExecutor.execute(ImmutableList.of(csvRecord));
>> >> > > > Iterator<Pair<byte[], List<KeyValue>>>
uncommittedDataIterator =
>> >> > > > PhoenixRuntime.getUncommittedDataIterator(conn, true);
>> >> > > > //...
>> >> > > >
>> >> > >
>> >> > > The non-code translation of those steps is:
>> >> > > 1. Parse the CSV record
>> >> > > 2. Convert the contents of the CSV record into KeyValues
>> >> > >
>> >> > > Although it may look as though data is being written over the
wire
>> to
>> >> > > Phoenix, the execution of an upsert executor and retrieval of
the
>> >> > > uncommitted KeyValues is all local (in memory). The code is
>> implemented
>> >> > in
>> >> > > this way because JDBC is the general API used within Phoenix --
>> there
>> >> > isn't
>> >> > > direct "convert fields to Phoenix encoding" API, although this
is
>> doing
>> >> > the
>> >> > > equivalent operation.
>> >> > >
>> >> > > Could you give some more information on your performance numbers?
>> For
>> >> > > example, is this the throughput that you're getting in a single
>> >> process,
>> >> > or
>> >> > > over a number of processes? If so, how many processes? Also, how
>> many
>> >> > > columns are in the records that you're loading?
>> >> > >
>> >> > >
>> >> > > >
>> >> > > > We plan to load up-to 100TB of data and overall performance
of the
>> >> > > > bulk-loader is not satisfactory.
>> >> > > >
>> >> > >
>> >> > > How many records are in that 100TB? What is the current (projected)
>> >> time
>> >> > > required to load the data? What is the minimum allowable ingest
>> speed
>> >> to
>> >> > be
>> >> > > considered satisfactory?
>> >> > >
>> >> > > - Gabriel
>> >> > >
>> >> >
>> >>
>>

Mime
View raw message