phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Bulk-loader performance
Date Sat, 14 Mar 2015 17:32:24 GMT
Please file a JIRA, Tulasi. This is a fair point. I'm surprised it's
4x faster. Can you share your code for the direct encoding path there
too? Are you still doing the CSV parsing in your code? Also, are you
sorting the KeyValues or do you know that they'll be in row key order
in the CSV file?

In the meantime, I'll cleanup the original patch. I have one more
improvement I can make that's pretty straightforward too.

Thanks,
James

On Fri, Mar 13, 2015 at 2:54 PM, Tulasi Paradarami
<tulasi.krishna.p@gmail.com> wrote:
>>
>> I don't know of any benchmarks vs. HBase bulk loader. Would be interesting,
>> if you could come up with an apples-to-apples test.
>
>
> I did some testing to get an apples-to-apples comparison between the two
> options.
>
> For 10 million rows (primary key is a 3 column composite key with 3 column
> qualifiers):
> JDBC bulk-loading: 430 sec (after applying PHOENIX-1711 patch)
> Direct Phoenix encoding: 112 sec
>
> Using direct encoding path executes in 1/4th the JDBC time and I think, the
> difference is significant enough to provide APIs for direct Phoenix
> encoding in the bulk-loader.
>
> Thanks
>
>
> On Thu, Mar 5, 2015 at 2:13 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>
>> I don't know of any benchmarks vs. HBase bulk loader. Would be interesting,
>> if you could come up with an apples-to-apples test.
>>
>> 100TB binary file cannot be partitioned at all? You're always bound to a
>> single process. Bummer. I guess plan B could be pre-processing the binary
>> file into something splittable. You'll cover the data twice, but if Phoenix
>> encoding really is the current bottleneck, as your mail indicates, then
>> separating the decoding of the binary file from encoding of the Phoenix
>> output should allow for parallelizing the second step and improve the state
>> of things.
>>
>> Mean time, would be good to look at perf improvements of the Phoenix
>> encoding step. Any volunteers lurking about?
>>
>> -n
>>
>> On Thu, Mar 5, 2015 at 1:08 PM, Tulasi Paradarami <
>> tulasi.krishna.p@gmail.com> wrote:
>>
>> > Gabriel, Nick, thanks for your inputs. My comments below.
>> >
>> > Although it may look as though data is being written over the wire to
>> > > Phoenix, the execution of an upsert executor and retrieval of the
>> > > uncommitted KeyValues is all local (in memory). The code is implemented
>> > in
>> > > this way because JDBC is the general API used within Phoenix -- there
>> > isn't
>> > > direct "convert fields to Phoenix encoding" API, although this is doing
>> > the
>> > > equivalent operation.
>> >
>> > I understand, data processing is in memory but performance can be
>> improved
>> > if there is a direct conversion to Phoenix encoding.
>> > Are there any performance comparison results between phoenix & hbase
>> > bulk-loader?
>> >
>> > Could you give some more information on your performance numbers? For
>> > > example, is this the throughput that you're getting in a single
>> process,
>> > or
>> > > over a number of processes? If so, how many processes?
>> >
>> > Its currently running as a single mapper processing a binary file
>> > (un-splittable). Disk throughput doesn't look to be an issue here.
>> > Production has machines of the same processing capability but obviously
>> > more number of nodes and input files.
>> >
>> >
>> > Also, how many columns are in the records that you're loading?
>> >
>> > The row-size is small: 3 integers for PK, 2 short qualifiers, 1 varchar
>> > qualifier
>> >
>> > What is the current (projected) time required to load the data?
>> >
>> > About 20-25 days
>> >
>> >
>> > What is the minimum allowable ingest speed to be considered satisfactory?
>> >
>> > We would like to finish the load in less than 10-12 days.
>> >
>> >
>> > You can make things go faster by increasing the number of mappers.
>> >
>> > The input file (binary) is not-splittable, a mapper is tied to the
>> specific
>> > file.
>> >
>> > What changes did you make to the map() method? Increased logging,
>> > > performance enhancements, plugging in custom logic, something else?
>> >
>> > I added custom logic to the map() method.
>> >
>> >
>> >
>> > On Thu, Mar 5, 2015 at 7:53 AM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>> >
>> > > Also: how large is your cluster? You can make things go faster by
>> > > increasing the number of mappers. What changes did you make to the
>> map()
>> > > method? Increased logging, performance enhancements, plugging in custom
>> > > logic, something else?
>> > >
>> > > On Thursday, March 5, 2015, Gabriel Reid <gabriel.reid@gmail.com>
>> wrote:
>> > >
>> > > > Hi Tulasi,
>> > > >
>> > > > Answers (and questions) inlined below:
>> > > >
>> > > > On Thu, Mar 5, 2015 at 2:41 AM Tulasi Paradarami <
>> > > > tulasi.krishna.p@gmail.com <javascript:;>>
>> > > > wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > Here are the details of our environment:
>> > > > > Phoenix 4.3
>> > > > > HBase 0.98.6
>> > > > >
>> > > > > I'm loading data to a Phoenix table using the csv bulk-loader
>> (after
>> > > > making
>> > > > > some changes to the map(...) method) and it is processing about
>> > 16,000
>> > > -
>> > > > > 20,000 rows/sec. I noticed that the bulk-loader spends upto 40%
of
>> > the
>> > > > > execution time in the following steps.
>> > > >
>> > > >
>> > > > > //...
>> > > > > csvRecord = csvLineParser.parse(value.toString());
>> > > > > csvUpsertExecutor.execute(ImmutableList.of(csvRecord));
>> > > > > Iterator<Pair<byte[], List<KeyValue>>> uncommittedDataIterator
=
>> > > > > PhoenixRuntime.getUncommittedDataIterator(conn, true);
>> > > > > //...
>> > > > >
>> > > >
>> > > > The non-code translation of those steps is:
>> > > > 1. Parse the CSV record
>> > > > 2. Convert the contents of the CSV record into KeyValues
>> > > >
>> > > > Although it may look as though data is being written over the wire
to
>> > > > Phoenix, the execution of an upsert executor and retrieval of the
>> > > > uncommitted KeyValues is all local (in memory). The code is
>> implemented
>> > > in
>> > > > this way because JDBC is the general API used within Phoenix -- there
>> > > isn't
>> > > > direct "convert fields to Phoenix encoding" API, although this is
>> doing
>> > > the
>> > > > equivalent operation.
>> > > >
>> > > > Could you give some more information on your performance numbers?
For
>> > > > example, is this the throughput that you're getting in a single
>> > process,
>> > > or
>> > > > over a number of processes? If so, how many processes? Also, how many
>> > > > columns are in the records that you're loading?
>> > > >
>> > > >
>> > > > >
>> > > > > We plan to load up-to 100TB of data and overall performance of
the
>> > > > > bulk-loader is not satisfactory.
>> > > > >
>> > > >
>> > > > How many records are in that 100TB? What is the current (projected)
>> > time
>> > > > required to load the data? What is the minimum allowable ingest speed
>> > to
>> > > be
>> > > > considered satisfactory?
>> > > >
>> > > > - Gabriel
>> > > >
>> > >
>> >
>>

Mime
View raw message