spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <>
Subject Re: how to load some of the files in a dir and monitor new file in that dir in spark streaming without missing?
Date Tue, 12 May 2015 11:41:46 GMT
I believe fileStream would pickup the new files (may be you should increase
the batch duration). You can see the implementation details for finding new
files from here

Best Regards

On Mon, May 11, 2015 at 6:31 PM, lisendong <> wrote:

> I have one hdfs dir, which contains many files:
> /user/root/1.txt
> /user/root/2.txt
> /user/root/3.txt
> /user/root/4.txt
> and there is a daemon process which add one file per minute to this dir.
> (e.g., 5.txt, 6.txt, 7.txt...)
> I want to start a spark streaming job which load 3.txt, 4.txt and then
> detect all the new files after 4.txt.
> Please pay attention that because these files are large, processing these
> files will take a long time. So if I process 3.txt and 4.txt before
> launching the streaming task, maybe the 5.txt, 6.txt will be produced into
> this dir during processing 3.txt and 4.txt. And when the streaming task
> start, 5.txt and 6.txt will be missed for processing because it will only
> process from new file(from 7.txt)
> I'm not sure if I describe the problem clearly, if you have any question,
> please ask me
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message