arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maarten Ballintijn <maart...@xs4all.nl>
Subject Re: pyarrow read_csv with different amount of columns per row
Date Tue, 19 Nov 2019 20:09:35 GMT
Hi Elisa,

One option is to preprocess the file and add the missing columns.
You can do this using two passes (reading once to determine the number of columns
and once writing out the lines filled out to the right number of columns)
This does not need to take a lot of memory as  you can read line by line.
(see http://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects <http://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects>)

Another option is to make an iterator to do this on the fly.

Either way the resulting DataFrame is going to be large:

14e6 rows * 43 columns * 8 bytes/float ~ 5 Gbyte  (in the best case)

Cheers,
Maarten



> On Nov 19, 2019, at 4:51 AM, Antoine Pitrou <antoine@python.org> wrote:
> 
> 
> No, there is no way to load CSV files with irregular dimensions, and we
> don't have any plans currently to support them.  Sorry :-(
> 
> Regards
> 
> Antoine.
> 
> 
> Le 19/11/2019 à 05:54, Micah Kornfield a écrit :
>> +dev@arrow to see if there is a more definitive answer, but I don't believe
>> this type of functionality is supported currently.
>> 
>> 
>> 
>> 
>> On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
>> elisa.scandellari@gmail.com> wrote:
>> 
>>> Hi,
>>> I'm trying to improve the performance of my program that loads csv data
>>> and manipulates it.
>>> My CSV file contains 14 million rows and has a variable amount of columns.
>>> The first 27 columns will always be available, and a row can have up to 16
>>> more columns for a total of 43.
>>> 
>>> Using vanilla pandas I've found this workaround:
>>> ```
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *largest_column_count = 0with open(data_file, 'r') as temp_f:    lines =
>>> temp_f.readlines()    for l in lines:        column_count =
>>> len(l.split(',')) + 1        largest_column_count = column_count if
>>> largest_column_count < column_count else
>>> largest_column_counttemp_f.close()column_names = [i for i in range(0,
>>> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
>>> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
>>> regex=True)*```
>>> This will create the table with all my data plus empty cells where the
>>> data is not available.
>>> With a smaller file, this works perfectly well. With the complete file, my
>>> memory usage goes over the roof.
>>> 
>>> I've been reading about Apache Arrow and, after a few attempts to load a
>>> structured csv file (same amount of columns for every row), I'm extremely
>>> impressed.
>>> I've tried to load my data file, using the same concept as above:
>>> ```
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
>>> [str(i) for i in range(len(fixed_column_names),
>>> largest_column_count)]total_columns =
>>> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
>>> csv.ReadOptions(column_names=total_columns)convert_options =
>>> csv.ConvertOptions(include_columns=total_columns,
>>>           include_missing_columns=True,
>>> strings_can_be_null=True)table = csv.read_csv(edr_filename,
>>> read_options=read_options, convert_options=convert_options)*
>>> ```
>>> but I get the following error
>>> ****Exception: CSV parse error: Expected 43 columns, got 32****
>>> 
>>> I need to use the csv provided by pyarrow, if not I wouldn't be able to
>>> create the pyarrow table to then convert to pandas
>>> ```from pyarrow import csv```
>>> 
>>> I guess that the csv library provided by pyarrow is more streamlined than
>>> the complete one.
>>> 
>>> Is there any way I can load this file? Maybe using some ReadOptions and/or
>>> ConvertOptions?
>>> I'd be using pandas to manipulate the data after it's been loaded.
>>> 
>>> Thank you in advance
>>> 
>>> 
>> 


Mime
View raw message