drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Date Tue, 13 Sep 2016 22:32:58 GMT
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/518
  
    Poking around in the code a bit more, it looks like we rely on the Jackson JsonParser
which delivers a stream of tokens. Here's the code in JsonReader:
    
        public ReadState write(ComplexWriter writer) throws IOException {
          JsonToken t = parser.nextToken();
    
    As you pointed out, white space is consumed internally to the tokenizer. But we do get
JSON tokens ({, }, identifier, number, etc.) Given this, you can scan ahead to find the close-open
bracket pattern.
    
    The trick seems to solve two problems. First, how do we handle tokens: The answer seems
to be a few lines down:
    
        ReadState readState = writeToVector(writer, t);
    
    Given a start token, we write to vector. writeToVector is a recursive-descent parser that
consumes tokens and does the right thing. Here, it accepts only a top-level object, rejecting
all other tokens. For an object, we call writeDataSwitch (funny name, that) which recursively
writes fields, objects, lists and the rest.
    
    Since this is a recursive-descent parser, JSON structure is represented by the call stack.
To bail out of an invalid parse, we have to unwind the stack, which is what an exception does.
So, that part may be good.
    
    The next step is how to get the tokenizer (JsonParser) pointed at the right point so we
can try to read the next object. That is where we need to consume tokens until we find the
end-start bracket pair.
    
    But, note that we have now consumed the start bracket, so we can't read it again when
we gain call writeToVector. Checking the code, the JsonParser has no unget( ) method, unfortunately.
To work around that, we need to "push" the token back onto the input. (Either actually doing
so, or by having internal state that says that we've already read the open bracket.) 
    
    We also have to ask if the JSON parser keeps track of parse state. Looking at the code,
it seems that JsonParser is really just a tokenizer: it has not state about the JSON structure.
(Would be good to run a small test to verify this observation.)
    
    By the way, the JSON parser class has lots of good stuff already there. For example, the
parser itself will keep track of line numbers and file locations. Perhaps we can use that
when reporting error positions.
    
    The last bit is that we've been building up a record as we parser JSON. If we fail part
way thorugh, we've got a half-build record. Again, here I'm a bit hazy on what Drill can do.
Can we "unwind" the current record? Can we mark the record as one to ignore (with a select
vector?) Or, do we live with the half-build record? Throwing away the half-built record would
be best, if we can do it.
    
    All that said, how much of the above does your proposed code change handle? What other
parts might still need to be added?
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message