arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bourgon, Michael" <mbour...@changehealthcare.com>
Subject Trying to partition for AWS Athena - the partition name can't be a field, but that's how Arrow does it?
Date Wed, 31 Jul 2019 21:28:46 GMT
(total newb - longtime SQL Server person, but new to Python & Arrow)

I'm trying to export a dataset for use with AWS Athena. As part of that, I want to partition
it.

In order to partition by day & hour (I only have a datetime aka TIMESTAMP), in my SELECT
I computed two columns (a date and an hour), passed them to the Table, then used them as the
partition_cols.  So far, so good.

But reading through the Athena documentation, you can't create the partitions on fields that
exist within the data.

https://docs.aws.amazon.com/athena/latest/ug/create-table.html
"Partitioned columns don't exist within the table data itself. If you use a value for col_name
that is the same as a table column, you get an error. For more information, see Partitioning
Data."


So, I'm stumped.  Short of explicitly pointing each partition at a folder, is there a way
to do this with Arrow?

Thanks!


import pandas as pd
import pypyodbc
import pyarrow.parquet as pq
import pyarrow as pa
con_string = ('Driver={SQL Server};'
'Server=myserver;'
'Database=mydb;'
'App=myname;'  #It's not application name!
'Trusted_Connection=yes')
cnxn = pypyodbc.connect(con_string)
query = """
SELECT *,
convert(date,submitted_datetime) as subdate,
datepart(hour,submitted_datetime) as subhour
FROM mytable
where submitted_datetime >='20190720' and submitted_datetime <'20190723'
"""
result_port_map = pd.read_sql(query, cnxn)

table = pa.Table.from_pandas(result_port_map)

pq.write_to_dataset(table, root_path='mytable',
                    partition_cols=['subdate','subhour'])

Mime
View raw message