hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-1758) processing escapes in a jute record is quadratic
Date Thu, 30 Aug 2007 05:02:30 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vivek Ratan updated HADOOP-1758:
--------------------------------

    Attachment: 1758_01.patch

Attached patch (1758_01.patch). while this fix is much more efficient than the current code,
it is not the most efficient. We read data from a stream into a temporary string, then append
to our string. A more efficient solution would be as EricW provided in an earlier comment.
However, the solution in the patch is only slightly more inefficient, but is a lot simpler
to read and understand, and it also mirrors the Java code. These are both important, IMO.
If ever an extra string copy does indeed cause performance problems, we can change the code.


> processing escapes in a jute record is quadratic
> ------------------------------------------------
>
>                 Key: HADOOP-1758
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1758
>             Project: Hadoop
>          Issue Type: Bug
>          Components: record
>    Affects Versions: 0.13.0
>            Reporter: Dick King
>            Assignee: Vivek Ratan
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: 1758_01.patch
>
>
> The following code appears in hadoop/src/c++/librecordio/csvarchive.cc :
> static void replaceAll(std::string s, const char *src, char c)
> {
>   std::string::size_type pos = 0;
>   while (pos != std::string::npos) {
>     pos = s.find(src);
>     if (pos != std::string::npos) {
>       s.replace(pos, strlen(src), 1, c);
>     }
>   }
> }
> This is used in the context of replacing jute escapes in the code:
> void hadoop::ICsvArchive::deserialize(std::string& t, const char* tag)
> {
>   t = readUptoTerminator(stream);
>   if (t[0] != '\'') {
>     throw new IOException("Errror deserializing string.");
>   }
>   t.erase(0, 1); /// erase first character
>   replaceAll(t, "%0D", 0x0D);
>   replaceAll(t, "%0A", 0x0A);
>   replaceAll(t, "%7D", 0x7D);
>   replaceAll(t, "%00", 0x00);
>   replaceAll(t, "%2C", 0x2C);
>   replaceAll(t, "%25", 0x25);
> }
> Since this replaces the entire string for each instance of the escape sequence, practically
anything would be better.  I would propose that within deserialize we allocate a char * [since
each replacement is smaller than the original], scan for each %, and either do a general hex
conversion in place or look for one of the six patterns, and after each replacement move down
the unmodified text and scan for the % fom that starting point.
> -dk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message