spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
Date Thu, 24 Oct 2019 00:43:20 GMT
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option
to clean up completed files in streaming query
URL: https://github.com/apache/spark/pull/22952#discussion_r338338952
 
 

 ##########
 File path: docs/structured-streaming-programming-guide.md
 ##########
 @@ -546,6 +546,13 @@ Here are the details of all the sources in Spark.
         "s3://a/dataset.txt"<br/>
         "s3n://a/b/dataset.txt"<br/>
         "s3a://a/b/c/dataset.txt"<br/>
+        <code>cleanSource</code>: option to clean up completed files after processing.<br/>
+        Available options are "archive", "delete", "off". If the option is not provided,
the default value is "off".<br/>
+        When "archive" is provided, additional option <code>sourceArchiveDir</code>
must be provided as well. The value of "sourceArchiveDir" must be outside of source path,
to ensure archived files are never included to new source files again.<br/>
+        Spark will move source files respecting its own path. For example, if the path of
source file is "/a/b/dataset.txt" and the path of archive directory is "/archived/here", file
will be moved to "/archived/here/a/b/dataset.txt"<br/>
+        NOTE: Both archiving (via moving) or deleting completed files would introduce overhead
(slow down) in each micro-batch, so you need to understand the cost for each operation in
your file system before enabling this option. On the other hand, enabling this option will
reduce the cost to list source files which is considered as a heavy operation.<br/>
+        NOTE 2: The source path should not be used from multiple sources or queries when
enabling this option, because source files will be moved or deleted which behavior may impact
the other sources and queries.<br/>
 
 Review comment:
   Revisiting the comment throughly and yes you're right that was the intention. OK to remove
latter part.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message