spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From liancheng <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-8838][SQL] Add config to enable/disable...
Date Wed, 29 Jul 2015 16:58:22 GMT
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/7238#issuecomment-126017537
  
    Hey @viirya, sorry that I lied, this might not be the FINAL check yet...
    
    Actually I began to regret one of my previous decision, namely merging part-files which
don't have corresponding summary files.  This is mostly because there are too many cases to
consider if we assume summary files may be missing, and this makes the behavior of this configuration
pretty much unintuitive---sometimes no part-files are merged, while sometimes some part-files
get merged.  Parquet summary files can be missing under various corner cases, it's hard to
explain the behavior and may confuse Spark user.  The key problem here is that Parquet summary
files are not written/accessed in an atomic manner.  And that's one of the most important
reason why the Parquet team is actually trying to get rid of the summary file entirely.
    
    Since the configuration is named "respectSummaryFiles", it seems more natural and intuitive
to assume that summary files are ALWAYS properly generated for ALL Parquet write jobs when
this configuration is turned on.  To be more specific, given one or more Parquet input paths,
we may find 1 or more summary files.  Metadata gathered by merging all these summary files
should reflect the real schema of the given Parquet dataset.  Only in this case, we can really
"respect" existing summary files.
    
    So my suggestion here is that, when the "respectSummaryFiles" configuration is turned
on, we only collects all summary files, merge schemas read from them, and just use the merged
schema as the final result schema.  And of course, this configuration should still be turned
off by default.  We can document this configuration with an "expert only" tag.
    
    I still consider this configuration quite useful, because even if you got a dirty Parquet
dataset without summary files or with incorrect summary files at hand, you can still repair
the summary files quite easily.  Essentially you only need to call `ParquetOutputFormat.writeMetaDataFile`,
either generates correct summary files for the entire dataset or deletes ill summary files
if it fails to merge all user-defined key-value metadata.
    
    How do you think?  Again, sorry for the late review extra efforts for implementing all
those intermediate versions...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message