spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Vanzin (JIRA)" <>
Subject [jira] [Commented] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.
Date Tue, 23 Jan 2018 17:55:00 GMT


Marcelo Vanzin commented on SPARK-21571:

Fixed with the patch for SPARK-20664 (which includes the code in the PR sent for this bug).

> Spark history server leaves incomplete or unreadable history files around forever.
> ----------------------------------------------------------------------------------
>                 Key: SPARK-21571
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.2.0
>            Reporter: Eric Vandenberg
>            Assignee: Eric Vandenberg
>            Priority: Minor
>             Fix For: 2.3.0
> We have noticed that history server logs are sometimes never cleaned up.  The current
history server logic *ONLY* cleans up history files if they are completed since in general
it doesn't make sense to clean up inprogress history files (after all, the job is presumably
still running?)  Note that inprogress history files would generally not be targeted for clean
up any way assuming they regularly flush logs and the file system accurately updates the history
log last modified time/size, while this is likely it is not guaranteed behavior.
> As a consequence of the current clean up logic and a combination of unclean shutdowns,
various file system bugs, earlier spark bugs, etc. we have accumulated thousands of these
dead history files associated with long since gone jobs.
> For example (with spark.history.fs.cleaner.maxAge=14d):
> -rw-rw----   3 xxxxxx                                           ooooooo      14382 2016-09-13
15:40 /user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq1974_ppppppppppp-8812_110586000000195_dev4384_jjjjjjjjjjjj-53982.zstandard
> -rw-rw----   3 xxxx                                             ooooooo       5933 2016-11-01
20:16 /user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq2016_ppppppppppp-8812_126507000000673_dev5365_jjjjjjjjjjjj-65313.lz4
> -rw-rw----   3 yyy                                              ooooooo          0 2017-01-19
11:59 /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0057_zzzz326_mmmmmmmmm-57863.lz4.inprogress
> -rw-rw----   3 xxxxxxxxx                                        ooooooo          0 2017-01-19
14:17 /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0063_zzzz688_mmmmmmmmm-33246.lz4.inprogress
> -rw-rw----   3 yyy                                              ooooooo          0 2017-01-20
10:56 /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1030_zzzz326_mmmmmmmmm-45195.lz4.inprogress
> -rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      11955 2017-01-20
17:55 /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1314_wwww54_kkkkkkkkkkkkkk-64671.lz4.inprogress
> -rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      11958 2017-01-20
17:55 /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1315_wwww1667_kkkkkkkkkkkkkk-58968.lz4.inprogress
> -rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      11960 2017-01-20
17:55 /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1316_wwww54_kkkkkkkkkkkkkk-48058.lz4.inprogress
> Based on the current logic, clean up candidates are skipped in several cases:
> 1. if a file has 0 bytes, it is completely ignored
> 2. if a file is in progress and not paresable/can't extract appID, is it completely ignored
> 3. if a file is complete and but not parseable/can't extract appID, it is completely
> To address this edge case and provide a way to clean out orphaned history files I propose
a new configuration option:
> spark.history.fs.cleaner.aggressive={true, false}, default is false.
> If true, the history server will more aggressively garbage collect history files in cases
(1), (2) and (3).  Since the default is false, existing customers won't be affected unless
they explicitly opt-in.  If customers have similar leaking garbage over time they have the
option of aggressively cleaning up in such cases.  Also note that aggressive clean up may
not be appropriate for some customers if they have long running jobs that exceed the cleaner.maxAge
time frame and/or have buggy file systems.
> Would like to get feedback on if this seems like a reasonable solution.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message