hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andre Savien" <andre.sav...@gmail.com>
Subject Single quote in function specification parameter
Date Wed, 19 Nov 2008 16:05:43 GMT
Greetings.

We created new pig function which calculates difference between two dates.
It has constructor parameters(date format strings) so we have to
define some alias first.
Our dates given in ISO format so format string should be "yyyyMMdd'T'HHmmssZ"

Sadly there is no way to pass this string as is.
I know that pig grammar supports escaping but function defininition is
something special.
Everything after "define" passed to org.apache.pig.impl.PigContext and
parsed there.
Parsing supports "escaping" but code is wrong.

/////////////////////////////
private static List<String> parseArguments(String argString){
        List<String> args = new ArrayList<String>();

        int startIndex = 0;
        int endIndex;
        while (startIndex < argString.length()) {
            while (startIndex < argString.length() &&
argString.charAt(startIndex++) != '\'')
                ;
            endIndex = startIndex;
            while (endIndex < argString.length() &&
argString.charAt(endIndex) != '\'') {
                if (argString.charAt(endIndex) == '\\')
                    endIndex++;
                endIndex++;
            }
               if (endIndex < argString.length()) {
                   args.add(argString.substring(startIndex, endIndex));
            }
            startIndex = endIndex + 1;
        }
        return args;
    }
/////////////////////////////

If you try to parse something like <'yyyyMMdd'T'HHmmssZ', 'yyyyMMdd'T'HHmmssZ'>
you will get four elements as result: "yyyyMMdd", "HHmmssZ",
"yyyyMMdd", "HHmmssZ"

If you try to parse something like "'yyyyMMdd\'T\'HHmmssZ',
'yyyyMMdd\'T\'HHmmssZ'"
you will get two elements as result but "\" will not be consumed:
"yyyyMMdd\'T\'HHmmssZ", "yyyyMMdd\'T\'HHmmssZ"



I've slightly modified parseArguments method so it works for me but
usage is still tricky because of
"double escaping":
define MinuteDiff DateDiff('yyyyMMdd\\\'T\\\'HHmmssZ',
'yyyyMMdd\\\'T\\\'HHmmssZ', 'm');

/////////////////////////////
private static List<String> parseArguments(String argString){
        List<String> args = new ArrayList<String>();

        int index = 0;
        while (index < argString.length()) {
            while (index < argString.length() &&
argString.charAt(index++) != '\'')
                ;
            boolean escape = false;
            StringBuilder arg = new StringBuilder();
            while (index < argString.length() && (escape ||
argString.charAt(index) != '\'')) {
                if (escape || argString.charAt(index) != '\\') {
                    escape = false;
                    arg.append(argString.charAt(index));
                } else {
                    escape = true;
                }
                index++;
            }
            if (index < argString.length()) {
                args.add(arg.toString());
            }
            index++;
        }
        return args;
    }
/////////////////////////////

I think main problem is not with code of parseArguments itself but
with double parsing so I do not propose
to accept my code as patch.
Much better to parse function specification with PigLatin parser and
use already parsed results in PigContext.

--
Mefi

Mime
View raw message