hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5109) timestamps are stored unencoded causing parse errors
Date Thu, 19 May 2016 00:30:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290166#comment-15290166
] 

Sangjin Lee commented on YARN-5109:
-----------------------------------

Here is a proposal. Instead of blindly splitting along the separator boundaries, we need to
be able to tell the Separator to tokenize using the size also. The following is a quick prototype
I put together to show the concept. This is an overloaded {{split()}} method:

{code:title=TimelineStorageUtils.java|borderStyle=solid}
  public static final int NO_LIMIT = 0;

  public static List<Range> splitRanges(byte[] source, byte[] separator,
      int[] sizes) {
    List<Range> segments = new ArrayList<Range>();
    if (source == null || separator == null || sizes == null) {
      return segments;
    }
    int start = 0;
    int i = 0;
    int k = 0;
    itersource: while (i < source.length && segments.size() < sizes.length)
{
      int currentTokenSize = sizes[k];
      if (currentTokenSize > NO_LIMIT) {
        // we explicitly grab a fixed number of bytes
        if (start + currentTokenSize > source.length) {
          // it's seeking beyond the source boundary
          throw new IllegalArgumentException("source is " + source.length +
              " bytes long and we're asking for " + (start + currentTokenSize));
        }
        segments.add(new Range(start, start + currentTokenSize));
        start += currentTokenSize;
        i += currentTokenSize;
        k++;
        // if there is more to parse, there must be a separator; strip it
        if (k <= sizes.length - 1) {
          for (int j = 0; j < separator.length; j++) {
            if (source[i + j] != separator[j]) {
              throw new IllegalArgumentException("separator is expected");
            }
            continue;
          }
          // matched the separator
          start = i + separator.length;
          i += separator.length;
        }
      } else if (currentTokenSize == NO_LIMIT) { // use the separator
        // continue until we match the separator
        for (int j = 0; j < separator.length; j++) {
          if (source[i + j] != separator[j]) {
            i++;
            continue itersource;
          }
          // we just matched all separator elements
          segments.add(new Range(start, i));
          start = i + separator.length;
          i += separator.length;
          k++;
        }
      } else {
        throw new IllegalArgumentException("negative size provided");
      }
    }
    // add the final segment
    if (start < source.length && segments.size() < sizes.length) {
      // by deduction this can happen only if the token size = NO_LIMIT
      segments.add(new Range(start, source.length));
    }
    return segments;
  }
{code}

You can basically instruct the utility if you expect certain size tokens when it parses the
bytes. Value = 0 indicates the existing parsing behavior (parse until you hit the separator).
Positive values indicate the number of bytes to grab whether or not there is a separator in
the middle. For example, if we expect the structure to be a string, a long (as bytes), and
a string you can invoke it with \{0, Long.BYTES, 0\}.

That way, we can dictate how the row keys and column name qualifiers should be parsed.

> timestamps are stored unencoded causing parse errors
> ----------------------------------------------------
>
>                 Key: YARN-5109
>                 URL: https://issues.apache.org/jira/browse/YARN-5109
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Sangjin Lee
>            Assignee: Varun Saxena
>            Priority: Blocker
>              Labels: yarn-2928-1st-milestone
>
> When we store timestamps (for example as part of the row key or part of the column name
for an event), the bytes are used as is without any encoding. If the byte value happens to
contain a separator character we use (e.g. "!" or "="), it causes a parse failure when we
read it.
> I came across this while looking into this error in the timeline reader:
> {noformat}
> 2016-05-17 21:28:38,643 WARN org.apache.hadoop.yarn.server.timelineservice.storage.common.TimelineStorageUtils:
incorrectly formatted column name: it will be discarded
> {noformat}
> I traced the data that was causing this, and the column name (for the event) was the
following:
> {noformat}
> i:e!YARN_RM_CONTAINER_CREATED=\x7F\xFF\xFE\xABDY=\x99=YARN_CONTAINER_ALLOCATED_HOST
> {noformat}
> Note that the column name is supposed to be of the format (event id)=(timestamp)=(event
info key). However, observe the timestamp portion:
> {noformat}
> \x7F\xFF\xFE\xABDY=\x99
> {noformat}
> The presence of the separator ("=") causes the parse error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message