arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bourgon, Michael" <mbour...@changehealthcare.com>
Subject RE: (External Email) Re: Trying to partition for AWS Athena - the partition name can't be a field, but that's how Arrow does it?
Date Thu, 01 Aug 2019 20:05:56 GMT
AH!  No, I missed that in the documentation.  I was making sure I had everything right before
I actually did it.  Let me go play with it some more - I'll comment again if there's issues
with them not being removed. Thanks for the quick response!

-----Original Message-----
From: Wes McKinney <wesmckinn@gmail.com> 
Sent: Wednesday, July 31, 2019 5:04 PM
To: user@arrow.apache.org
Subject: (External Email) Re: Trying to partition for AWS Athena - the partition name can't
be a field, but that's how Arrow does it?

Are you getting an error of some kind running this code?
write_to_dataset is supposed to remove the partition columns from each fragment that is written
to the dataset.

On Wed, Jul 31, 2019 at 4:28 PM Bourgon, Michael <mbourgon@changehealthcare.com> wrote:
>
> (total newb – longtime SQL Server person, but new to Python & Arrow)
>
>
>
> I’m trying to export a dataset for use with AWS Athena. As part of that, I want to
partition it.
>
>
>
> In order to partition by day & hour (I only have a datetime aka TIMESTAMP), in my
SELECT I computed two columns (a date and an hour), passed them to the Table, then used them
as the partition_cols.  So far, so good.
>
>
>
> But reading through the Athena documentation, you can’t create the partitions on fields
that exist within the data.
>
>
>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs
> .aws.amazon.com%2Fathena%2Flatest%2Fug%2Fcreate-table.html&amp;data=02
> %7C01%7Cmbourgon%40changehealthcare.com%7C53aa15b153b648544dd108d71603
> 2769%7Cd13a3cbde4ce45e191c298859a236439%7C0%7C0%7C637002075111461075&a
> mp;sdata=cErYrK1IimVcZeiqk3d3P6GNnAcj6mSnHftiiFdT7n0%3D&amp;reserved=0
>
> “Partitioned columns don't exist within the table data itself. If you use a value for
col_name that is the same as a table column, you get an error. For more information, see Partitioning
Data.”
>
>
>
>
>
> So, I’m stumped.  Short of explicitly pointing each partition at a folder, is there
a way to do this with Arrow?
>
>
>
> Thanks!
>
>
>
>
>
> import pandas as pd
>
> import pypyodbc
>
> import pyarrow.parquet as pq
>
> import pyarrow as pa
>
> con_string = ('Driver={SQL Server};'
>
> 'Server=myserver;'
>
> 'Database=mydb;'
>
> 'App=myname;'  #It's not application name!
>
> 'Trusted_Connection=yes')
>
> cnxn = pypyodbc.connect(con_string)
>
> query = """
>
> SELECT *,
>
> convert(date,submitted_datetime) as subdate,
>
> datepart(hour,submitted_datetime) as subhour
>
> FROM mytable
>
> where submitted_datetime >='20190720' and submitted_datetime <'20190723'
>
> """
>
> result_port_map = pd.read_sql(query, cnxn)
>
>
>
> table = pa.Table.from_pandas(result_port_map)
>
>
>
> pq.write_to_dataset(table, root_path='mytable',
>
>                     partition_cols=['subdate','subhour'])
Mime
View raw message