Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Fri, 5 Oct 2012 11:53:47 +1100 (NCT)
From: "Andy Isaacson (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <1679886178.3471.1349398427785.JavaMail.jiratomcat@arcas>
In-Reply-To: <1888238835.65467.1347397568225.JavaMail.jiratomcat@arcas>
Subject: [jira] [Commented] (HDFS-3920) libwebdhfs code cleanup: string
 processing and using strerror consistently to handle all errors
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469928#comment-13469928 ] 

Andy Isaacson commented on HDFS-3920:
-------------------------------------

Overall, looking good.  A few more improvements:
{code}
-        rbuffer->content = realloc(rbuffer->content, rbuffer->offset + size * nmemb + 1);
+        rbuffer->content = realloc(rbuffer->content,
+                                   rbuffer->offset + size * nmemb + 1);
         if (rbuffer->content == NULL) {
{code}
At this point we've leaked the previous {{rbuffer->content}} because realloc does not free on error.

Better to do something like
{code}
void *tmp = realloc(buf->content, ...);
if (tmp == NULL) {
   ...
   return NULL;
}
buf->content = tmp;
{code}

{code}
+    curl = curl_easy_init();
+    if (!curl) {
+        errno = ENOMEM;     // curl_easy_init does return error code,
+                            // and most of its errors are caused by malloc()
{code}
Maybe the comment means to say "does *not* return error code"?  It doesn't make very much sense to me currently.

{code}
+    if(followloc == YES) {
{code}
Style nit: put a space between {{if}} and the (.  (Several cases of this problem.)

{code}
+    offset = snprintf(url, length + 1, "%s%s:%d%s%s?op=%s",
+                      protocol, host, nnPort, prefix, path, op);
+    for (i = 0; i < paraNum; i++) {
+        if (!paraNames[i] || !paraValues[i]) {
+            continue;
+        }
+        offset += snprintf(url + offset, length - offset + 1,
+                           "&%s=%s", paraNames[i], paraValues[i]);
{code}
After calling snprintf, you must verify that there is still room left in the buffer and error out if not.  If {{length - offset + 1}} becomes negative, you might underflow and write past the end of the buffer.

{code}
+#define NUM_OF_PERMISSION_BITS 3
+#define NUM_OF_SHORT_VALUE_BITS 5
+#define NUM_OF_LONG_VALUE_BITS 15
{code}

Based on reading the code, I think these are intended to be the number of characters to allocate to hold a "%d" formatting of a relevant type. The name should say so -- NUM_PERM_CHAR, NUM_SHORT_CHAR, NUM_LONG_CHAR perhaps?  And since they are *always* used with +1 for the NUL, you may as well just include that here, in which case calling it PERM_STR_LEN, SHORT_STR_LEN, LONG_STR_LEN might be better.  Also, add a comment explaining where the values come from.

If my understanding is correct, then the right value for LONG is 21, not 16, since 2^64 is 20 decimal digits.

Combining all of that, this becomes 
{code}
#define PERM_STR_LEN 4 // "644" + one byte for NUL
#define SHORT_STR_LEN 6 // 65535 + NUL
#define LONG_STR_LEN 21 // 2^64-1 = 18446744073709551615 + NUL
{code}


{code}
+    if(!header || header[0] == '\0' ||
+                  strncmp(header, "HTTP/", strlen("HTTP/"))) {
{code}
Simplify this to {{if (!header || strncmp(header, ...}} since strncmp will handle a zero-length-string just fine.

{code}
-    if(strcmp(result,"0")) {                 //Content-Length should be equal to 0
+    if(strncmp(result, "0", 1)) {
{code}
This will do the wrong thing if result is "01".  I think the strcmp version is more correct.

{code}
+    const char *responseCode = "307 TEMPORARY_REDIRECT";
+    if(!header || header[0] == '\0' || strncmp(header,"HTTP/", 5)) {
...
+    }
+    if(!(tempHeader = strstr(header,responseCode))) {
{code}
# no need to special case the zero-length-string case
# strstr is not right here, it would succeed if given {{"HTTP/1.0 foobar baz 307 TEMPORARY_REDIRECT"}} which should fail. Better to {{strspn(" \t"}} then strncmp.
# space after comma.

{code}
+    if(header == '\0' || strncmp(header,"HTTP/", strlen("HTTP/"))) {
{code}
should be {{if ((header == NULL || strncmp...}}.  Do not confuse the pointer {{NULL}} with the char NUL {{'\0'}}.

{code}
+    if(!(strstr(header,responseCode1) && strstr(header, responseCode2))) {
{code}
More inappropriate use of strstr.

{code}
+    if(arraylen > 0) {
+        fileStat = realloc(fileStat, sizeof(hdfsFileInfo) * arraylen);
+    }
{code}
# need to check for realloc failing
# similar to above realloc comment, shouldn't reassign to fileStat without a temporary.
{code}
+static hdfsFileInfo *json_parse_array(json_t *jobj, hdfsFileInfo *fileStat,
+                                      int *numEntries, const char *operation,
+                                      int printError)
{code}
The name {{json_parse_array}} is too generic for something that takes a {{hdfsFileInfo}} argument.  Perhaps {{parse_json_fileinfo}} or similar.

The function has a very tricky definition -- it reallocates {{fileStat}} and returns the new buffer.  That deserves a doc comment at least, or a redesign where the argument is {{hdfsFileInfo **fileStat}}.
{code}
+    void *iter = json_object_iter(jobj);
+    while(iter)  {
+        key = json_object_iter_key(iter);
+        value = json_object_iter_value(iter);
+        switch (json_typeof(value)) {
+            case JSON_INTEGER:
+                if(!strcmp(key, "accessTime")) {
+                    fileStat->mLastAccess = json_integer_value(value) / 1000;
{code}
It's very confusing to have the {{strcmp(key}} inside the {{switch(type)}} code block.  This makes the decision flow very hard to follow.

Unless the units are obvious, magic numbers like {{/ 1000}} need a comment explaining why the value is being converted (from milliseconds to seconds?  or microseconds?).  Something like {{// json field contains time in milliseconds, hdfsFileInfo is counted in microseconds}} or similar.
{code}
+            default:
+                free(fileStat);
+                fileStat = NULL;
+        }
{code}
If we hit this default, the next loop through will segfault.  Better to exit the loop explicitly with a return or a test variable or a goto. If this is handling some quirk of {{json_object_iter_value}} then a comment is needed.

{code}
+                "file creation/appending, <%d>:%s.\n",
{code}
Some of these are missing a space after the {{:}}. I don't really care what format you use, but please be consistent.
{code}
+        newWorkingDir = normalizePath(newWorkingDir);
{code}
This leaks the old {{newWorkingDir}}.  But see below for an alternate design.
{code}
+/** Replace "//" with "/" in path */
+static char *normalizePath(char *path)
+{
+    const char *delimiter = "/";
...
+    token = strtok_r(path, delimiter, &savePtr);
{code}
This function would be much simpler if it did not generate the leading {{/}} (which it isn't documented to do) and modified the path in-place using an explicit bytewise loop rather than using {{strtok}}.  Something like
{code}
for (i=j=sawslash=0; i<pathlen; i++) {
   if (path[i] == '/') {
       if (sawslash) {
           continue;
       } else {
          sawslash = 1;
       }
    } else {
       sawslash = 0;
       path[j++] = path[i];
    }
}
path[j] = '\0';
{code}

Overall, very good cleanup.  Take care of the above and let's get this checked in and continue to make progress.

There's still quite a bit of hairy code in libwebhdfs that needs careful review and testing.  I'm still concerned about how much explicit string manipulation in C is being done; there are many places with fragile {{strncat}} and {{snprintf}} loops with unchecked potentially underflowable computations.  Keep in mind that this code will potentially be exposed to hostile network traffic and a single byte buffer overwrite gives the attacker code execution with high likelihood.  This is native C, not Java, and nearly every buffer offset bug is a critical code execution bug.  It would be much easier to see a path to confidence in this code if there were less ad-hoc string manipulation.
                
> libwebdhfs code cleanup: string processing and using strerror consistently to handle all errors
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3920
>                 URL: https://issues.apache.org/jira/browse/HDFS-3920
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>         Attachments: HDFS-3920-001.patch, HDFS-3920-001.patch, HDFS-3920-002.patch, HDFS-3920-003.patch, HDFS-3920-004.patch, HDFS-3920-005.patch
>
>
> 1. Clean up code for string processing;
> 2. Using strerror consistently for error handling;
> 3. Use sprintf to replace decToOctal
> 4. other issues requiring fixing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira