pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prashant Kommireddi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions
Date Sat, 23 Mar 2013 23:59:14 GMT

    [ https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611905#comment-13611905
] 

Prashant Kommireddi commented on PIG-3259:
------------------------------------------

I have some results from preliminary testing. Basically, the idea is to include a check for
whether the bytearray is numeric and make valueOf calls accordingly.

*Test code*
{code}
static void testBytesToLong() throws IOException {
        String input = "114121.1321";
        int n = 100000000;
        long start = System.currentTimeMillis();
        for (int i = 0; i < n; i++) {
            Long l = bytesToLongOptimized(input);
        }
        System.out.println("Elapsed (optimized): " + (System.currentTimeMillis() - start));
        
        start = System.currentTimeMillis();
        for (int i = 0; i < n; i++) {
            Long l =bytesToLong(input);
        }
        System.out.println("Elapsed (current): " + (System.currentTimeMillis() - start));
        
    }
{code}

*Current implementation, logic the same minus logging*
{code}
    static Long bytesToLong(String number) {
        if (sanityCheckIntegerLong(number)) {
            return Long.valueOf(number);
        }

        try {
            return Long.valueOf(Double.valueOf(number).longValue());
        } catch (NumberFormatException e) {
            return null;
        }
    }
{code}

*Optimized code*
{code}
    static Long bytesToLongOptimized(String number) {
        if (SanityChecker.sanityCheckIntegerLongDecimal(number)) {
            if(!SanityChecker.isDecimal()) {
                return Long.valueOf(number);
            } 
            return Long.valueOf(Double.valueOf(number).longValue());
        }
       
        return null;
    }

private static class SanityChecker {
        // This is a counter on number of dots (period) in the string 
        static int numDots = 0;
        
        private static boolean sanityCheckIntegerLongDecimal(String number) {
            // Reset counter on each call
            reset();
            for (int i=0; i < number.length(); i++){
                if (number.charAt(i) >= '0' && number.charAt(i) <='9' || i ==
0 && number.charAt(i) == '-'
                        || (number.charAt(i) == '.' && ++numDots < 2)){
                    // valid one
                }
                else{
                    // contains invalid characters, must not be a integer or long or decimal.
                    return false;
                }
            }
            return true;
        }

        private static void reset() {
            numDots = 0;
        }
        
        private static boolean isDecimal() {
            return numDots == 1;
        }
    }
{code}

There is not much difference in runtime between current and optimized versions with respect
to valid Long numbers, however the delta is significant in case of invalid Longs (for eg "123foo",
"10.2.3.10"). I will attach my findings soon.

                
> Optimize byte to Long/Integer conversions
> -----------------------------------------
>
>                 Key: PIG-3259
>                 URL: https://issues.apache.org/jira/browse/PIG-3259
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11, 0.11.1
>            Reporter: Prashant Kommireddi
>             Fix For: 0.12
>
>
> These conversions can be performing better. If the input is not numeric (1234abcd) the
code calls Double.valueOf(String) regardless before finally returning null. Any script that
inadvertently (user's mistake or not) tries to cast alpha-numeric column to int or long would
result in many wasteful calls. 
> We can avoid this and only handle the cases we find the input to be a decimal number
(1234.56) and return null otherwise even before trying Double.valueOf(String).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message