spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Haviv <daniel.ha...@veracity-group.com>
Subject Re: How to load only the data of the last partition
Date Thu, 17 Nov 2016 19:47:39 GMT
Hi Samy,
If you're working with hive you could create a partitioned table and update
it's partitions' locations to the last version so when you'll query it
using spark, you'll always get the latest version.

Daniel

On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane <samy@dindane.com> wrote:

> Hi,
>
> I have some data partitioned this way:
>
> /data/year=2016/month=9/version=0
> /data/year=2016/month=10/version=0
> /data/year=2016/month=10/version=1
> /data/year=2016/month=10/version=2
> /data/year=2016/month=10/version=3
> /data/year=2016/month=11/version=0
> /data/year=2016/month=11/version=1
>
> When using this data, I'd like to load the last version only of each month.
>
> A simple way to do this is to do `load("/data/year=2016/month=11/version=3")`
> instead of doing `load("/data")`.
> The drawback of this solution is the loss of partitioning information such
> as `year` and `month`, which means it would not be possible to apply
> operations based on the year or the month anymore.
>
> Is it possible to ask Spark to load the last version only of each month?
> How would you go about this?
>
> Thank you,
>
> Samy
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message