Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BAE4E1864C for ; Sat, 29 Aug 2015 02:32:38 +0000 (UTC) Received: (qmail 79470 invoked by uid 500); 29 Aug 2015 02:32:38 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 79415 invoked by uid 500); 29 Aug 2015 02:32:38 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 79404 invoked by uid 500); 29 Aug 2015 02:32:38 -0000 Delivered-To: apmail-incubator-drill-dev@incubator.apache.org Received: (qmail 79398 invoked by uid 99); 29 Aug 2015 02:32:38 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Aug 2015 02:32:38 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 2C4AE26D043; Sat, 29 Aug 2015 02:32:37 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============4128759614709225414==" MIME-Version: 1.0 Subject: Re: Review Request 37893: DRILL-3718: TSV reader fails when "" appears From: "Sean Hsuan-Yi Chu" To: "Jacques Nadeau" , "Mehant Baid" Cc: "Sean Hsuan-Yi Chu" , "drill" Date: Sat, 29 Aug 2015 02:32:37 -0000 Message-ID: <20150829023237.13583.98354@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: "Sean Hsuan-Yi Chu" X-ReviewGroup: drill-git X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/37893/ X-Sender: "Sean Hsuan-Yi Chu" References: <20150828214218.13583.74485@reviews.apache.org> In-Reply-To: <20150828214218.13583.74485@reviews.apache.org> Reply-To: "Sean Hsuan-Yi Chu" X-ReviewRequest-Repository: drill-git --===============4128759614709225414== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit > On Aug. 28, 2015, 9:42 p.m., Jacques Nadeau wrote: > > Can you please start out by discussing an approach. I need to look at this in more detail but I think you're trying to correct a symptom rather than the root problem. This is generic code (sometimes used for tab delimited, sometimes for comma, somtimes for space or pipe). As such, I don't expect to see a tab character in the common code. If we need to change conditions for this situation, we need to figure out the right way. For example, what if someone uses a space delimiter for fields? It seems like we're going to hit a similar problem. > > > > Additionally, this is extremely performance sensitive code. Before doing submitting any change on this code, you need to do performance testing. I used a 2gb CSV file for performance testing purposes previously. > > Sean Hsuan-Yi Chu wrote: > Let me first explain the symptom. > > The problem happens when we have double-quoted fields in the file. > > If there are whitespaces after the second quote (e.g., “current field” ,next field), > > TextReader will try to ignore them (the if-statement is doing this task). > > However, it is incorrect because those whitespaces could be used as delimiter. > > > > > > My solution: > > Thus, my solution is to add one more condition as follows: > > “if character is whitespace and it is not used as delimiter” > > Besides, I found a new writing which would be more generic and not need to have ‘\t’ in the code: > > if (ch != newLine && ch <= ' ' && ch != delimiter) > > —— > > Now I am doing some performance testing. Once it is done and makes sense, I will give the result and upload the latest patch. I did an experiment and collected statistics here. 1. Data: a csv file with 6 rows and 60 fields per row. 2. Let the two versions of TextReader read the same csv file three times: On average, the running times are 394.68 (the one on master) versus 413.11 (this patch). It is 4% slower. It seems ok? - Sean Hsuan-Yi ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/37893/#review96936 ----------------------------------------------------------- On Aug. 28, 2015, 9:35 p.m., Sean Hsuan-Yi Chu wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/37893/ > ----------------------------------------------------------- > > (Updated Aug. 28, 2015, 9:35 p.m.) > > > Review request for drill, Jacques Nadeau and Mehant Baid. > > > Bugs: DRILL-3718 > https://issues.apache.org/jira/browse/DRILL-3718 > > > Repository: drill-git > > > Description > ------- > > For TSV files, if the TextReader reads a double quote, it would keep scanning until it gets the second double quote. > > However, even getting the second double quote, the current reader will keep going in order to trim the space (i.e., ' '). > > In tsv, there is no need to trim '\t' (tab), which is used to separate fields. > > > Diffs > ----- > > exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java 3899509 > exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java 6b74ecf > exec/java-exec/src/test/resources/store/text/WithQuote.tsv PRE-CREATION > > Diff: https://reviews.apache.org/r/37893/diff/ > > > Testing > ------- > > All > > > Thanks, > > Sean Hsuan-Yi Chu > > --===============4128759614709225414==--