phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-1711) Improve performance of CSV loader
Date Mon, 09 Mar 2015 08:27:38 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352669#comment-14352669
] 

Gabriel Reid commented on PHOENIX-1711:
---------------------------------------

FWIW, my take on this topic in general is that the numbers are pretty much in line with what
I would expect as far as where the work is being done (i.e. 18% of the time spent in parsing
the input, and 39% of the time spent converting into Phoenix encoding). Seeing as those two
tasks are the only real functionality performed by this tool, I think it's to be expected
that they're taking up ~60% of the execution time. That being said, obviously making things
faster is a good thing (as long as it doesn't come at the cost of breaking things).

Looking at the patch, I saw the following in {{org.apache.phoenix.mapreduce.CsvToKeyValueMapper#setup}}
{code}
        try {
            csvUpsertExecutor = buildUpsertExecutor(conf);
        } catch (SQLException e) {
            e.printStackTrace();
        }
{code}

We definitely want to throw that exception up the stack there and not just print the stack
trace, as otherwise this is just going to lead to a NPE later.

I almost had the feeling that this patch is the combination of a couple of patches, could
that be? Or are all the changes in there necessary? For example, is the change in PArrayDataType
intended to be in this patch?

Also, considering that the optimization in this change is about speeding up the following
(pseudo-code) calling pattern:
{code}
for listOfValues in input:
    for value in listOfValues:
        preparedStatement.setObject(value)
    preparedStatement.execute()
{code}

would it be apply this fix so that users of the public APIs will also take advantage of it?
I can imagine that there are a lot of realtime ingest use cases where the same prepared statement
is just being used over and over to ingest data, so I think it would be good if we can minimize
the work being done in (re-)compiling the statement every time there as well.

> Improve performance of CSV loader
> ---------------------------------
>
>                 Key: PHOENIX-1711
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1711
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>         Attachments: PHOENIX-1711.patch
>
>
> Here is a break-up of percentage execution time for some of the steps inthe mapper:
> csvParser: 18%
> csvUpsertExecutor.execute(ImmutableList.of(csvRecord)): 39%
> PhoenixRuntime.getUncommittedDataIterator(conn, true): 9%
> ´╗┐while (uncommittedDataIterator.hasNext()): 15%
> Read IO & custom processing: 19%
> See details here: http://s.apache.org/6rl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message