drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ppadma <...@git.apache.org>
Subject [GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...
Date Mon, 20 Nov 2017 06:52:25 GMT
Github user ppadma commented on the issue:

    @arina-ielchiieva I am concerned about performance impact by grouping all splits in a
single reader (essentially, not parallelizing at all).
    Wondering if it is possible to do this way:
    During planning, in HiveScan,  if it is text file and has header/footer, get the number
of rows to skip. Read the header/footer rows and based on that, adjust the first/last split
and offset within them. The splits which have only header/footer rows can be removed from
inputSplits. In HiveSubScan, change hiveReadEntry to be a list (one entry for each split).
Add an entry in hiveReadEntry, numRowsToSkip (or offsetToStart) which can be passed to the
recordReaders in getBatch for each subScan. This is fairly complicated and I am sure I might
be missing some details :-)


View raw message