hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Milind Bhandarkar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1758) processing escapes in a jute record is quadratic
Date Wed, 22 Aug 2007 22:42:30 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521950
] 

Milind Bhandarkar commented on HADOOP-1758:
-------------------------------------------

Dick,

EricW provided this replacement for hadoop::ICsvArchive::deserialize, which he said worked
great for him. Can you try it out ?

{code}
void hadoop::ICsvArchive::deserialize(std::string& t, const char* tag)
{
   char c;
   if (1 != stream.read(&c, 1)) {
     throw new IOException("Error in deserialization.");
   }
   if (c != '\'') {
     throw new IOException("Errror deserializing string.");
   }
   while (1) {
     char c;
     if (1 != stream.read(&c, 1)) {
       throw new IOException("Error in deserialization.");
     }
     if (c == ',' || c == '\n' || c == '}') {
       if (c != ',') {
         stream.pushBack(c);
       }
       break;
     }
     else if (c == '%') {
       char d[2];
       if (2 != stream.read(&d, 2)) {
         throw new IOException("Error in deserialization.");
       }
       if (strncmp(d, "0D", 2) == 0) {
         t.push_back(0x0D);
       }
       else if (strncmp(d, "0A", 2) == 0) {
         t.push_back(0x0A);
       }
       else if (strncmp(d, "7D", 2) == 0) {
         t.push_back(0x7D);
       }
       else if (strncmp(d, "00", 2) == 0) {
         t.push_back(0x00);
       }
       else if (strncmp(d, "2C", 2) == 0) {
         t.push_back(0x2C);
       }
       else if (strncmp(d, "25", 2) == 0) {
         t.push_back(0x25);
       }
       else {
         t.push_back(c);
         t.push_back(d[1]);
         t.push_back(d[2]);
       }
     }
     else {
       t.push_back(c);
     }
   }
}
{code}

> processing escapes in a jute record is quadratic
> ------------------------------------------------
>
>                 Key: HADOOP-1758
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1758
>             Project: Hadoop
>          Issue Type: Bug
>          Components: record
>    Affects Versions: 0.13.0
>            Reporter: Dick King
>            Priority: Blocker
>
> The following code appears in hadoop/src/c++/librecordio/csvarchive.cc :
> static void replaceAll(std::string s, const char *src, char c)
> {
>   std::string::size_type pos = 0;
>   while (pos != std::string::npos) {
>     pos = s.find(src);
>     if (pos != std::string::npos) {
>       s.replace(pos, strlen(src), 1, c);
>     }
>   }
> }
> This is used in the context of replacing jute escapes in the code:
> void hadoop::ICsvArchive::deserialize(std::string& t, const char* tag)
> {
>   t = readUptoTerminator(stream);
>   if (t[0] != '\'') {
>     throw new IOException("Errror deserializing string.");
>   }
>   t.erase(0, 1); /// erase first character
>   replaceAll(t, "%0D", 0x0D);
>   replaceAll(t, "%0A", 0x0A);
>   replaceAll(t, "%7D", 0x7D);
>   replaceAll(t, "%00", 0x00);
>   replaceAll(t, "%2C", 0x2C);
>   replaceAll(t, "%25", 0x25);
> }
> Since this replaces the entire string for each instance of the escape sequence, practically
anything would be better.  I would propose that within deserialize we allocate a char * [since
each replacement is smaller than the original], scan for each %, and either do a general hex
conversion in place or look for one of the six patterns, and after each replacement move down
the unmodified text and scan for the % fom that starting point.
> -dk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message