spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julian (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-20568) Delete files after processing in structured streaming
Date Thu, 08 Feb 2018 13:22:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356929#comment-16356929
] 

Julian commented on SPARK-20568:
--------------------------------

I've started with Data ingestion using Structured Streaming where we will be processing large
amounts of csv data (later XML via kafka to which I hope to switch to the kafka structured
streaming source). In short, about 6+GB per minute which we need to process/transform through
Spark. On smaller scale / user data sets, I can understand wanting to keep the data, however
on large scale ELT/ETL and/or streaming flows, we typically want to archive the last N hours/days
for recovery purposes - the raw data is just too large to keep (note above is just one of
already 30 data sources we have connected and many more are coming). Often upstream systems
also can re-push the data, so keeping is not a problem for all sources. It is very useful
for us to be able to move the data once it is processed. I have no choice but to implement
a solution for this, but I at least know now I need to build something for this. I can think
of some simple "hdfs dfs -mv" commands to achieve something like this but I'm not yet fully
understanding the relationship between the input files, for each writer close() method and parallel
nature on the HDP cluster. Also, I notice if the process dies and restarts, it reads the
data again (at the moment) which would be a disaster with this much data! Need to figure that
out to.

> Delete files after processing in structured streaming
> -----------------------------------------------------
>
>                 Key: SPARK-20568
>                 URL: https://issues.apache.org/jira/browse/SPARK-20568
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Saul Shanabrook
>            Priority: Major
>
> It would be great to be able to delete files after processing them with structured streaming.
> For example, I am reading in a bunch of JSON files and converting them into Parquet.
If the JSON files are not deleted after they are processed, it quickly fills up my hard drive.
I originally [posted this on Stack Overflow|http://stackoverflow.com/q/43671757/907060] and
was recommended to make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message