Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C0A68200B92 for ; Wed, 14 Sep 2016 00:33:00 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id BF44D160AD3; Tue, 13 Sep 2016 22:33:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1244B160AD2 for ; Wed, 14 Sep 2016 00:32:59 +0200 (CEST) Received: (qmail 2125 invoked by uid 500); 13 Sep 2016 22:32:59 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 2114 invoked by uid 99); 13 Sep 2016 22:32:58 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2016 22:32:58 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id A4B61E01BD; Tue, 13 Sep 2016 22:32:58 +0000 (UTC) From: paul-rogers To: dev@drill.apache.org Reply-To: dev@drill.apache.org References: In-Reply-To: Subject: [GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir... Content-Type: text/plain Message-Id: <20160913223258.A4B61E01BD@git1-us-west.apache.org> Date: Tue, 13 Sep 2016 22:32:58 +0000 (UTC) archived-at: Tue, 13 Sep 2016 22:33:00 -0000 Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/518 Poking around in the code a bit more, it looks like we rely on the Jackson JsonParser which delivers a stream of tokens. Here's the code in JsonReader: public ReadState write(ComplexWriter writer) throws IOException { JsonToken t = parser.nextToken(); As you pointed out, white space is consumed internally to the tokenizer. But we do get JSON tokens ({, }, identifier, number, etc.) Given this, you can scan ahead to find the close-open bracket pattern. The trick seems to solve two problems. First, how do we handle tokens: The answer seems to be a few lines down: ReadState readState = writeToVector(writer, t); Given a start token, we write to vector. writeToVector is a recursive-descent parser that consumes tokens and does the right thing. Here, it accepts only a top-level object, rejecting all other tokens. For an object, we call writeDataSwitch (funny name, that) which recursively writes fields, objects, lists and the rest. Since this is a recursive-descent parser, JSON structure is represented by the call stack. To bail out of an invalid parse, we have to unwind the stack, which is what an exception does. So, that part may be good. The next step is how to get the tokenizer (JsonParser) pointed at the right point so we can try to read the next object. That is where we need to consume tokens until we find the end-start bracket pair. But, note that we have now consumed the start bracket, so we can't read it again when we gain call writeToVector. Checking the code, the JsonParser has no unget( ) method, unfortunately. To work around that, we need to "push" the token back onto the input. (Either actually doing so, or by having internal state that says that we've already read the open bracket.) We also have to ask if the JSON parser keeps track of parse state. Looking at the code, it seems that JsonParser is really just a tokenizer: it has not state about the JSON structure. (Would be good to run a small test to verify this observation.) By the way, the JSON parser class has lots of good stuff already there. For example, the parser itself will keep track of line numbers and file locations. Perhaps we can use that when reporting error positions. The last bit is that we've been building up a record as we parser JSON. If we fail part way thorugh, we've got a half-build record. Again, here I'm a bit hazy on what Drill can do. Can we "unwind" the current record? Can we mark the record as one to ignore (with a select vector?) Or, do we live with the half-build record? Throwing away the half-built record would be best, if we can do it. All that said, how much of the above does your proposed code change handle? What other parts might still need to be added? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---