Mailing-List: contact dev-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
From: paul-rogers <git@git.apache.org>
To: dev@drill.apache.org
Reply-To: dev@drill.apache.org
References: <git-pr-518-drill@git.apache.org>
In-Reply-To: <git-pr-518-drill@git.apache.org>
Subject: [GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Content-Type: text/plain
Message-Id: <20160913223258.A4B61E01BD@git1-us-west.apache.org>
Date: Tue, 13 Sep 2016 22:32:58 +0000 (UTC)
archived-at: Tue, 13 Sep 2016 22:33:00 -0000

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/518
  
    Poking around in the code a bit more, it looks like we rely on the Jackson JsonParser which delivers a stream of tokens. Here's the code in JsonReader:
    
        public ReadState write(ComplexWriter writer) throws IOException {
          JsonToken t = parser.nextToken();
    
    As you pointed out, white space is consumed internally to the tokenizer. But we do get JSON tokens ({, }, identifier, number, etc.) Given this, you can scan ahead to find the close-open bracket pattern.
    
    The trick seems to solve two problems. First, how do we handle tokens: The answer seems to be a few lines down:
    
        ReadState readState = writeToVector(writer, t);
    
    Given a start token, we write to vector. writeToVector is a recursive-descent parser that consumes tokens and does the right thing. Here, it accepts only a top-level object, rejecting all other tokens. For an object, we call writeDataSwitch (funny name, that) which recursively writes fields, objects, lists and the rest.
    
    Since this is a recursive-descent parser, JSON structure is represented by the call stack. To bail out of an invalid parse, we have to unwind the stack, which is what an exception does. So, that part may be good.
    
    The next step is how to get the tokenizer (JsonParser) pointed at the right point so we can try to read the next object. That is where we need to consume tokens until we find the end-start bracket pair.
    
    But, note that we have now consumed the start bracket, so we can't read it again when we gain call writeToVector. Checking the code, the JsonParser has no unget( ) method, unfortunately. To work around that, we need to "push" the token back onto the input. (Either actually doing so, or by having internal state that says that we've already read the open bracket.) 
    
    We also have to ask if the JSON parser keeps track of parse state. Looking at the code, it seems that JsonParser is really just a tokenizer: it has not state about the JSON structure. (Would be good to run a small test to verify this observation.)
    
    By the way, the JSON parser class has lots of good stuff already there. For example, the parser itself will keep track of line numbers and file locations. Perhaps we can use that when reporting error positions.
    
    The last bit is that we've been building up a record as we parser JSON. If we fail part way thorugh, we've got a half-build record. Again, here I'm a bit hazy on what Drill can do. Can we "unwind" the current record? Can we mark the record as one to ignore (with a select vector?) Or, do we live with the half-build record? Throwing away the half-built record would be best, if we can do it.
    
    All that said, how much of the above does your proposed code change handle? What other parts might still need to be added?
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---