drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tobias (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5358) Error if Parquet file changes during query
Date Thu, 16 Mar 2017 08:22:41 GMT
Tobias created DRILL-5358:

             Summary: Error if Parquet file changes during query
                 Key: DRILL-5358
                 URL: https://issues.apache.org/jira/browse/DRILL-5358
             Project: Apache Drill
          Issue Type: Bug
          Components: Metadata, Storage - Parquet
    Affects Versions: 1.9.0
            Reporter: Tobias

We have a scenario where we generate our own parquet files
every X amount of seconds.
These files are in a structure based on date and it is only the file for today that gets updated

The process is as follows

1. generate parquet file in temp directory
2. When finished generation mv the file into a drill workspace/ (data/2017/03/10/data.parquet,
3. Then restart the process

We have noticed that if the file is moved in while a query has started running
it will throw and error that the parquet magic number is incorrect
This is due to the file length being cached and reused so basically what seems to happen is

1. Drill plans the query
2. File gets changed under Drills feet
3. Drill executes query and tries to read and incorrect offset of the changed file

Is there anyway to fix this or avoid this scenario?
Another side effect of constantly generating a new file is that the metadata cache gets discarded
for the whole workspace despite only one file changing
Is there a way to avoid that?

This message was sent by Atlassian JIRA

View raw message