drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Hsuan-Yi Chu" <hsua...@usc.edu>
Subject Re: Review Request 37893: DRILL-3718: TSV reader fails when "" appears
Date Sat, 29 Aug 2015 02:32:37 GMT


> On Aug. 28, 2015, 9:42 p.m., Jacques Nadeau wrote:
> > Can you please start out by discussing an approach.  I need to look at this in more
detail but I think you're trying to correct a symptom rather than the root problem.  This
is generic code (sometimes used for tab delimited, sometimes for comma, somtimes for space
or pipe).  As such, I don't expect to see a tab character in the common code.  If we need
to change conditions for this situation, we need to figure out the right way.  For example,
what if someone uses a space delimiter for fields?  It seems like we're going to hit a similar
problem.
> > 
> > Additionally, this is extremely performance sensitive code.  Before doing submitting
any change on this code, you need to do performance testing.  I used a 2gb CSV file for performance
testing purposes previously.
> 
> Sean Hsuan-Yi Chu wrote:
>     Let me first explain the symptom.
>     
>     The problem happens when we have double-quoted fields in the file.
>     
>     If there are whitespaces after the second quote (e.g., “current field”      ,next
field),
>     
>     TextReader will try to ignore them (the if-statement is doing this task).
>     
>     However, it is incorrect because those whitespaces could be used as delimiter. 
>     
>     
>     
>     
>     
>     My solution:
>     
>     Thus, my solution is to add one more condition as follows:
>     
>     “if character is whitespace and it is not used as delimiter” 
>     
>     Besides, I found a new writing which would be more generic and not need to have ‘\t’
in the code:
>     
>     if (ch != newLine && ch <= ' ' && ch != delimiter)   
>     
>     ——
>     
>     Now I am doing some performance testing. Once it is done and makes sense, I will
give the result and upload the latest patch.

I did an experiment and collected statistics here. 

1. Data: a csv file with 6 rows and 60 fields per row.
2. Let the two versions of TextReader read the same csv file three times: On average, the
running times are 394.68 (the one on master) versus 413.11 (this patch). It is 4% slower.
It seems ok?


- Sean Hsuan-Yi


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37893/#review96936
-----------------------------------------------------------


On Aug. 28, 2015, 9:35 p.m., Sean Hsuan-Yi Chu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37893/
> -----------------------------------------------------------
> 
> (Updated Aug. 28, 2015, 9:35 p.m.)
> 
> 
> Review request for drill, Jacques Nadeau and Mehant Baid.
> 
> 
> Bugs: DRILL-3718
>     https://issues.apache.org/jira/browse/DRILL-3718
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> For TSV files, if the TextReader reads a double quote, it would keep scanning until it
gets the second double quote.
> 
> However, even getting the second double quote, the current reader will keep going in
order to trim the space (i.e., ' '). 
> 
> In tsv, there is no need to trim '\t' (tab), which is used to separate fields.
> 
> 
> Diffs
> -----
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java
3899509 
>   exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java 6b74ecf 
>   exec/java-exec/src/test/resources/store/text/WithQuote.tsv PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/37893/diff/
> 
> 
> Testing
> -------
> 
> All
> 
> 
> Thanks,
> 
> Sean Hsuan-Yi Chu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message