pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "pat chan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-3341) Improving performance of loading datetime values
Date Mon, 03 Jun 2013 00:01:20 GMT

    [ https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672729#comment-13672729
] 

pat chan commented on PIG-3341:
-------------------------------

Correction on option 3.
Implement a Joda date formatter using the spec in the w3c profile so only that format is supported.
                
> Improving performance of loading datetime values
> ------------------------------------------------
>
>                 Key: PIG-3341
>                 URL: https://issues.apache.org/jira/browse/PIG-3341
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.11.1
>            Reporter: pat chan
>            Priority: Minor
>             Fix For: 0.12, 0.11.2
>
>
> The performance of loading datetime values can be improved by about 25% by moving a single
line in ToDate.java:
>     public static DateTimeZone extractDateTimeZone(String dtStr) {
>       Pattern pattern = Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");;
> should become:
>     static Pattern pattern = Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");
>     public static DateTimeZone extractDateTimeZone(String dtStr) {
> There is no need to recompile the regular expression for every value. I'm not sure if
this function is ever called concurrently, but Pattern objects are thread-safe anyways.
> As a test, I created a file of 10M timestamps:
>   for i in 0..10000000
>     puts '2000-01-01T00:00:00+23'
>   end
> I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;
> Before the change it took 160s.
> After the change, the script took 120s.
> ----------------
> Another performance improvement can be made for invalid datetime values. If a datetime
value is invalid, an exception is created and thrown, which is a costly way to fail a validity
check. To test the performance impact, I created 10M invalid datetime values:
>   for i in 0..10000000
>     puts '2000-99-01T00:00:00+23'
>   end
> In this test, the regex pattern was always recompiled. I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump B;
> The script took 190s.
> I understand this could be considered an edge case and might not be worth changing. However,
if there are use cases where invalid dates are part of normal processing, then you might consider
fixing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message