spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Directory / File Reading Patterns
Date Sun, 18 Jan 2015 08:47:06 GMT
I think that putting part of the data (only) in a filename is an
anti-pattern, but we sometimes have to play these where they lie.

You can list all the directory paths containing the CSV files, map them
each to RDDs with textFile, transform the RDDs to include info from the
path, and then simply union them.

This should be pretty fine performance wise even.
On Jan 17, 2015 11:48 PM, "Steve Nunez" <snunez@hortonworks.com> wrote:

>  Hello Users,
>
>  I’ve got a real-world use case that seems common enough that its pattern
> would be documented somewhere, but I can’t find any references to a simple
> solution. The challenge is that data is getting dumped into a directory
> structure, and that directory structure itself contains features that I
> need in my model. For example:
>
>  bank_code
> Trader
> Day-1.csv
> Day-2.csv
> …
>
>  Each CVS file contains a list of all the trades made by that individual
> each day. The problem is that the bank & trader should be part of the
> feature set. I.e. We need the RDD to look like:
> (bank, trader, day, <list-of-trades>)
>
>  Anyone got any elegant solutions for doing this?
>
>  Cheers,
> - SteveN
>
>
>
>
>

Mime
View raw message