drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Khurram Faraaz <kfar...@maprtech.com>
Subject Re: Review Request 37893: DRILL-3718: TSV reader fails when "" appears
Date Sat, 29 Aug 2015 19:08:34 GMT
Hsuan,

A test with 6 rows and 60 fields per row is not the best test (at least to
evaluate the performance).
Please note a previous comment in this email chain mentions the use of a
2gb CSV file for performance testing purposes previously.

I would suggest that you should run your patch of a larger dataset and
then evaluate performance regressions if any.

On Fri, Aug 28, 2015 at 7:32 PM, Sean Hsuan-Yi Chu <hsuanyi@usc.edu> wrote:

>
>
> > On Aug. 28, 2015, 9:42 p.m., Jacques Nadeau wrote:
> > > Can you please start out by discussing an approach.  I need to look at
> this in more detail but I think you're trying to correct a symptom rather
> than the root problem.  This is generic code (sometimes used for tab
> delimited, sometimes for comma, somtimes for space or pipe).  As such, I
> don't expect to see a tab character in the common code.  If we need to
> change conditions for this situation, we need to figure out the right way.
> For example, what if someone uses a space delimiter for fields?  It seems
> like we're going to hit a similar problem.
> > >
> > > Additionally, this is extremely performance sensitive code.  Before
> doing submitting any change on this code, you need to do performance
> testing.  I used a 2gb CSV file for performance testing purposes previously.
> >
> > Sean Hsuan-Yi Chu wrote:
> >     Let me first explain the symptom.
> >
> >     The problem happens when we have double-quoted fields in the file.
> >
> >     If there are whitespaces after the second quote (e.g., “current
> field”      ,next field),
> >
> >     TextReader will try to ignore them (the if-statement is doing this
> task).
> >
> >     However, it is incorrect because those whitespaces could be used as
> delimiter.
> >
> >
> >
> >
> >
> >     My solution:
> >
> >     Thus, my solution is to add one more condition as follows:
> >
> >     “if character is whitespace and it is not used as delimiter”
> >
> >     Besides, I found a new writing which would be more generic and not
> need to have ‘\t’ in the code:
> >
> >     if (ch != newLine && ch <= ' ' && ch != delimiter)
> >
> >     ——
> >
> >     Now I am doing some performance testing. Once it is done and makes
> sense, I will give the result and upload the latest patch.
>
> I did an experiment and collected statistics here.
>
> 1. Data: a csv file with 6 rows and 60 fields per row.
> 2. Let the two versions of TextReader read the same csv file three times:
> On average, the running times are 394.68 (the one on master) versus 413.11
> (this patch). It is 4% slower. It seems ok?
>
>
> - Sean Hsuan-Yi
>
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37893/#review96936
> -----------------------------------------------------------
>
>
> On Aug. 28, 2015, 9:35 p.m., Sean Hsuan-Yi Chu wrote:
> >
> > -----------------------------------------------------------
> > This is an automatically generated e-mail. To reply, visit:
> > https://reviews.apache.org/r/37893/
> > -----------------------------------------------------------
> >
> > (Updated Aug. 28, 2015, 9:35 p.m.)
> >
> >
> > Review request for drill, Jacques Nadeau and Mehant Baid.
> >
> >
> > Bugs: DRILL-3718
> >     https://issues.apache.org/jira/browse/DRILL-3718
> >
> >
> > Repository: drill-git
> >
> >
> > Description
> > -------
> >
> > For TSV files, if the TextReader reads a double quote, it would keep
> scanning until it gets the second double quote.
> >
> > However, even getting the second double quote, the current reader will
> keep going in order to trim the space (i.e., ' ').
> >
> > In tsv, there is no need to trim '\t' (tab), which is used to separate
> fields.
> >
> >
> > Diffs
> > -----
> >
> >
>  exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java
> 3899509
> >   exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java
> 6b74ecf
> >   exec/java-exec/src/test/resources/store/text/WithQuote.tsv PRE-CREATION
> >
> > Diff: https://reviews.apache.org/r/37893/diff/
> >
> >
> > Testing
> > -------
> >
> > All
> >
> >
> > Thanks,
> >
> > Sean Hsuan-Yi Chu
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message