spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] dongjoon-hyun commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
Date Fri, 31 Jan 2020 06:02:58 GMT
dongjoon-hyun commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document
event log compaction into new section of monitoring.md
URL: https://github.com/apache/spark/pull/27398#discussion_r373327221
 
 

 ##########
 File path: docs/monitoring.md
 ##########
 @@ -95,6 +95,44 @@ The history server can be configured as follows:
   </tr>
 </table>
 
+### Applying compaction of old event log files
+
+A long-running streaming application can bring a huge single event log file which may cost
a lot to maintain and
+also requires bunch of resource to replay per each update in Spark History Server.
+
+Enabling <code>spark.eventLog.rolling.enabled</code> and <code>spark.eventLog.rolling.maxFileSize</code>
would
+let you have multiple event log files instead of single huge event log file which may help
some scenarios on its own,
+but it still doesn't help you reducing the overall size of logs.
+
+Spark History Server can apply 'compaction' on the rolling event log files to reduce the
overall size of
+logs, via setting the configuration <code>spark.history.fs.eventLog.rolling.maxFilesToRetain</code>
on the
+Spark History Server.
+
+When the compaction happens, History Server lists the all available event log files, and
considers the event log files older than
+retained log files as a target of compaction. For example, if the application A has 5 event
log files and
+<code>spark.history.fs.eventLog.rolling.maxFilesToRetain</code> is set to 2,
first 3 log files will be selected to be compacted.
+
+Once it selects the files, it analyzes these files to figure out which events can be excluded,
and rewrites these files
+into one compact file with discarding some events. Once rewriting is done, original log files
will be deleted.
+
+The compaction tries to exclude the events which point to the outdated things like jobs,
and so on. As of now, below describes
+the candidates of events to be excluded:
+
+* Events for the job which is finished, and related stage/tasks events
+* Events for the executor which is terminated
+* Events for the SQL execution which is finished, and related job/stage/tasks events
+
+but the details can be changed afterwards.
+
+Please note that Spark History Server may not compact the old event log files if figures
out not a lot of space
+would be reduced during compaction. For streaming query (including Structured Streaming)
we normally expect compaction
+will run as each micro-batch will trigger one or more jobs which will be finished shortly,
but compaction won't run
+in many cases for batch query.
+
+Please also note that this is a new feature introduced in Spark 3.0, and may not be completely
stable. In some circumstance,
+the compaction may exclude more events than you expect, leading some UI issues on History
Server for the application.
+Use with caution.
 
 Review comment:
   Oh. Got it. In that case, you are right. We need to wait.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message