spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From andreweduffy <...@git.apache.org>
Subject [GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Date Tue, 27 Sep 2016 17:34:35 GMT
Github user andreweduffy commented on the issue:

    https://github.com/apache/spark/pull/14649
  
    Glad that helped, sorry if it wasn't more clear. Agreed that writing summary metadata
isn't always the best. In this patch, it only ever performs the file pruning if the _metadata
file exists for the dataset. At work we have it enabled since we have a query-heavy workload
where new data lands occasionally. 
    
    Sent from Outlook
    
    
    
    
    On Tue, Sep 27, 2016 at 10:13 AM -0700, "Cheng Lian" <notifications@github.com>
wrote:
    
    
    
    
    
    
    
    
    
    
    
    
    @andreweduffy @andreweduffy Thanks for the explanations! This makes much more sense to
me now. 
    
    
    
    Although _metadata can be neat for the read path, it's a trouble maker for the write path:
    
    
    Writing summary files (either _metadata or _common_metadata) can be quite expensive when
writing a large Parquet dataset since it reads footers from all files and tries to merge them.
This can be especially frustrating when appending a small amount of data to an existing large
dataset.
    Parquet doesn't always write the summary files even if you explicitly set parquet.enable.summary-metadata
to true. For example, when two files have different values of a single key in the user-defined
key/value metadata section, Parquet simply gives up writing the summary files and delete existing
ones. This may be quite common in the case of schema evolution. What makes it worse, outdated
_common_metadata might not be deleted properly due to PARQUET-359, which makes the summary
files out of sync.
    
    
    
    
    However, I still agree that with an existing trustworthy _metadata file at hand, this
patch is still very useful. I'll take a deeper look at this tomorrow.
    
    
    
    —
    You are receiving this because you were mentioned.
    Reply to this email directly, view it on GitHub, or mute the thread.
    
    
      
      
    
    
    
    
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message