spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deenar Toraskar <deenar.toras...@gmail.com>
Subject Re: get directory names that are affected by sc.textFile("path/to/dir/*/*/*.js")
Date Tue, 27 Oct 2015 15:13:52 GMT
This won't work as you can never guarantee which files were read by Spark
if some other process is writing files to the same location. It would be
far less work to move files matching your pattern to a staging location and
then load them using sc.textFile. you should find hdfs file system calls
that are equivalent to normal file system if command line tools like distcp
or mv don't meet your needs.
On 27 Oct 2015 1:49 p.m., "Նարեկ Գալստեան" <ngalstyan4@gmail.com> wrote:

> Dear Spark users,
>
> I am reading a set of json files to compile them to Parquet data format.
> I am willing to mark the folders in some way after having read their
> contents so that I do not read it again(e.g. I can changed the name of the
> folder).
>
> I use .textFile("path/to*/dir/*/*/*.js") *technique to* automatically
> *detect
> the files.
> I cannot however, use the same notation* to rename them.*
>
> Could you suggest how I can *get the names of these folders* so that I can
> rename them using native hadoop libraries.
>
> I am using Apache Spark 1.4.1
>
> I look forward to hearing suggestions!!
>
> yours,
>
> Narek
>
> Նարեկ Գալստյան
>

Mime
View raw message