Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CAHUQ+_bYE-0Y0H0OgZjkC0Fz8aQsa1GLfqx1+ddo4PaTyiFPHA@mail.gmail.com>
References: 
 <CAEjgAV1OmE69hB6LLrX6-tUgvD3MoeCdZ10RkFaRuQsSv1Qj8A@mail.gmail.com>
 <CAHUQ+_bYE-0Y0H0OgZjkC0Fz8aQsa1GLfqx1+ddo4PaTyiFPHA@mail.gmail.com>
From: Masf <masfworld@gmail.com>
Date: Thu, 27 Aug 2015 12:51:35 +0200
Message-ID: 
 <CAEjgAV3XtNNUQc7J1vFXkrry9=9SW3OmvFXTwKtnG3aWvQWVGA@mail.gmail.com>
Subject: Re: SQLContext load. Filtering files
To: Akhil Das <akhil@sigmoidanalytics.com>
Cc: "user@spark.apache.org" <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a11c1aa9091ed81051e48c24d

--001a11c1aa9091ed81051e48c24d
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks Akhil, I will have a look.

I have a dude regarding to spark streaming and filestream. If spark
streaming crashs and while spark was down new files are created in input
folder, when spark streaming is launched again, how can I process these
files?

Thanks.
Regards.
Miguel.


On Thu, Aug 27, 2015 at 12:29 PM, Akhil Das <akhil@sigmoidanalytics.com>
wrote:

> Have a look at the spark streaming. You can make use of the ssc.fileStrea=
m.
>
> Eg:
>
> val avroStream =3D ssc.fileStream[AvroKey[GenericRecord], NullWritable,
>       AvroKeyInputFormat[GenericRecord]](input)
>
> You can also specify a filter function
> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spar=
k.streaming.StreamingContext>
> as the second argument.
>
> Thanks
> Best Regards
>
> On Wed, Aug 19, 2015 at 10:46 PM, Masf <masfworld@gmail.com> wrote:
>
>> Hi.
>>
>> I'd like to read Avro files using this library
>> https://github.com/databricks/spark-avro
>>
>> I need to load several files from a folder, not all files. Is there some
>> functionality to filter the files to load?
>>
>> And... Is is possible to know the name of the files loaded from a folder=
?
>>
>> My problem is that I have a folder where an external process is insertin=
g
>> files every X minutes and I need process these files once, and I can't
>> move, rename or copy the source files.
>>
>>
>> Thanks
>> --
>>
>> Regards
>> Miguel =C3=81ngel
>>
>
>


--=20


Saludos.
Miguel =C3=81ngel

--001a11c1aa9091ed81051e48c24d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks Akhil, I will have a look.<div><br></div><div>I hav=
e a dude regarding to spark streaming and filestream. If spark streaming cr=
ashs and while spark was down new files are created in input folder, when s=
park streaming is launched again, how can I process these files?</div><div>=
<br></div><div>Thanks.</div><div>Regards.</div><div>Miguel.</div><div><br><=
/div><div><br></div></div><div class=3D"gmail_extra"><br><div class=3D"gmai=
l_quote">On Thu, Aug 27, 2015 at 12:29 PM, Akhil Das <span dir=3D"ltr">&lt;=
<a href=3D"mailto:akhil@sigmoidanalytics.com" target=3D"_blank">akhil@sigmo=
idanalytics.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><di=
v dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:&#39;courie=
r new&#39;,monospace;color:rgb(0,0,0)">Have a look at the spark streaming. =
You can make use of the ssc.fileStream.</div><div class=3D"gmail_default" s=
tyle=3D"font-family:&#39;courier new&#39;,monospace;color:rgb(0,0,0)"><br><=
/div><div class=3D"gmail_default" style=3D"font-family:&#39;courier new&#39=
;,monospace;color:rgb(0,0,0)">Eg:</div><div class=3D"gmail_default" style=
=3D"font-family:&#39;courier new&#39;,monospace;color:rgb(0,0,0)"><br></div=
><div class=3D"gmail_default" style=3D"font-family:&#39;courier new&#39;,mo=
nospace;color:rgb(0,0,0)"><div class=3D"gmail_default">val avroStream =3D s=
sc.fileStream[AvroKey[GenericRecord], NullWritable,=C2=A0</div><div class=
=3D"gmail_default">=C2=A0 =C2=A0 =C2=A0 AvroKeyInputFormat[GenericRecord]](=
input)</div><div class=3D"gmail_default"><br></div><div class=3D"gmail_defa=
ult">You can also specify a <a href=3D"http://spark.apache.org/docs/latest/=
api/scala/index.html#org.apache.spark.streaming.StreamingContext" target=3D=
"_blank">filter function</a> as the second argument.=C2=A0</div></div></div=
><div class=3D"gmail_extra"><br clear=3D"all"><div><div><div dir=3D"ltr">Th=
anks<div>Best Regards</div></div></div></div><div><div class=3D"h5">
<br><div class=3D"gmail_quote">On Wed, Aug 19, 2015 at 10:46 PM, Masf <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:masfworld@gmail.com" target=3D"_blank">m=
asfworld@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quot=
e" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">=
<div dir=3D"ltr">Hi.<div><br></div><div>I&#39;d like to read Avro files usi=
ng this library <a href=3D"https://github.com/databricks/spark-avro" target=
=3D"_blank">https://github.com/databricks/spark-avro</a></div><div><br></di=
v><div>I need to load several files from a folder, not all files. Is there =
some functionality to filter the files to load?</div><div><br></div><div>An=
d... Is is possible to know the name of the files loaded from a folder?</di=
v><div><br></div><div>My problem is that I have a folder where an external =
process is inserting files every X minutes and I need process these files o=
nce, and I can&#39;t move, rename or copy the source files.</div><div><div>=
<br></div><div><br></div><div>Thanks</div><span><font color=3D"#888888">-- =
<br><div><div dir=3D"ltr"><br><div>Regards<br>Miguel =C3=81ngel</div></div>=
</div>
</font></span></div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature"><div dir=3D"ltr"><br><div><br>Saludos.<br>Miguel =C3=
=81ngel</div></div></div>
</div>

--001a11c1aa9091ed81051e48c24d--