drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill pull request #824: DRILL-3867: Store relative paths in metadata file
Date Sun, 18 Jun 2017 22:54:33 GMT
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/824#discussion_r122602595
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java
---
    @@ -264,15 +275,18 @@ private ParquetTableMetadata_v3 getParquetTableMetadata(List<FileStatus>
fileSta
       /**
        * Get a list of file metadata for a list of parquet files
        *
    -   * @param fileStatuses
    -   * @return
    +   * @param parquetTableMetadata_v3 can store column schema info from all the files and
row groups
    +   * @param fileStatuses list of the parquet files statuses
    +   * @param absolutePathInMetadata true if result metadata files should contain absolute
paths, false for relative paths.
    +   *                               Relative paths in the metadata are only necessary
while creating meta cache files.
    +   * @return list of the parquet file metadata (parquet metadata for every file)
        * @throws IOException
        */
    -  private List<ParquetFileMetadata_v3> getParquetFileMetadata_v3(
    -      ParquetTableMetadata_v3 parquetTableMetadata_v3, List<FileStatus> fileStatuses)
throws IOException {
    +  private List<ParquetFileMetadata_v3> getParquetFileMetadata_v3(ParquetTableMetadata_v3
parquetTableMetadata_v3,
    +      List<FileStatus> fileStatuses, boolean absolutePathInMetadata) throws IOException
{
    --- End diff --
    
    Is this really needed? Or, is it an attempt to answer my earlier concern about compatibility?
    
    Only newer Drill instances will create metadata. If we want relative paths, then we should
always use relative paths. No need to pass along a flag.
    
    On the other hand, if we are saying that the root call is absolute (as seen in the code
earlier), but subdirectories are relative, then doesn't the presence of even one absolute
directory name make the whole feature invalid?
    
    Perhaps some more background explanation in the PR comments (or even a design spec) might
shed some light on what we are trying to accomplish here. Very hard to simply reverse engineer
a design from code changes...
    
    Also, below, we have a method to convert relative paths to absolute in bulk. Should we
do the same here? Always gather data in absolute form, then convert it to relative just before
serializing?
    
    I wasn't sure why we are converting paths from relative to absolute. If we are doing that
because we use absolute paths internally, then it is OK to gather absolute paths here. Convert
the to relative just before writing if that is easier.
    
    Here, I'm referring to the note about the "proposed alternative solution".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message