spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Cuthbert <>
Subject hdfs streaming context
Date Mon, 01 Dec 2014 22:41:02 GMT

Is it possible to stream on HDFS directory and listen for multiple files?

I have tried the following

val sparkConf = new SparkConf().setAppName("HdfsWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = ssc.textFileStream("hdfs://localhost:8020/user/data/*")
lines.filter(line => line.contains("GE"))

But I get

14/12/01 21:35:42 ERROR JobScheduler: Error generating jobs for time 1417469742000 ms File hdfs://localhost:8020/user/data/*does not exist.
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(
	at org.apache.hadoop.fs.FileSystem.listStatus(
	at org.apache.hadoop.fs.FileSystem.listStatus(
	at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107)
	at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75)
To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message