arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Pitrou <anto...@python.org>
Subject Re: pyarrow read_csv with different amount of columns per row
Date Tue, 19 Nov 2019 09:51:06 GMT

No, there is no way to load CSV files with irregular dimensions, and we
don't have any plans currently to support them.  Sorry :-(

Regards

Antoine.


Le 19/11/2019 à 05:54, Micah Kornfield a écrit :
> +dev@arrow to see if there is a more definitive answer, but I don't believe
> this type of functionality is supported currently.
> 
> 
> 
> 
> On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
> elisa.scandellari@gmail.com> wrote:
> 
>> Hi,
>> I'm trying to improve the performance of my program that loads csv data
>> and manipulates it.
>> My CSV file contains 14 million rows and has a variable amount of columns.
>> The first 27 columns will always be available, and a row can have up to 16
>> more columns for a total of 43.
>>
>> Using vanilla pandas I've found this workaround:
>> ```
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *largest_column_count = 0with open(data_file, 'r') as temp_f:    lines =
>> temp_f.readlines()    for l in lines:        column_count =
>> len(l.split(',')) + 1        largest_column_count = column_count if
>> largest_column_count < column_count else
>> largest_column_counttemp_f.close()column_names = [i for i in range(0,
>> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
>> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
>> regex=True)*```
>> This will create the table with all my data plus empty cells where the
>> data is not available.
>> With a smaller file, this works perfectly well. With the complete file, my
>> memory usage goes over the roof.
>>
>> I've been reading about Apache Arrow and, after a few attempts to load a
>> structured csv file (same amount of columns for every row), I'm extremely
>> impressed.
>> I've tried to load my data file, using the same concept as above:
>> ```
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
>> [str(i) for i in range(len(fixed_column_names),
>> largest_column_count)]total_columns =
>> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
>> csv.ReadOptions(column_names=total_columns)convert_options =
>> csv.ConvertOptions(include_columns=total_columns,
>>            include_missing_columns=True,
>>  strings_can_be_null=True)table = csv.read_csv(edr_filename,
>> read_options=read_options, convert_options=convert_options)*
>> ```
>> but I get the following error
>> ****Exception: CSV parse error: Expected 43 columns, got 32****
>>
>> I need to use the csv provided by pyarrow, if not I wouldn't be able to
>> create the pyarrow table to then convert to pandas
>> ```from pyarrow import csv```
>>
>> I guess that the csv library provided by pyarrow is more streamlined than
>> the complete one.
>>
>> Is there any way I can load this file? Maybe using some ReadOptions and/or
>> ConvertOptions?
>> I'd be using pandas to manipulate the data after it's been loaded.
>>
>> Thank you in advance
>>
>>
> 

Mime
View raw message