arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elisa Scandellari <elisa.scandell...@gmail.com>
Subject Re: pyarrow read_csv with different amount of columns per row
Date Wed, 20 Nov 2019 08:02:51 GMT
Hi,
Thank you for your replies.. my solution is very similar to Marteen's
suggestions.
Using the method from my first email, I wrote a new file with only the data
I needed in the fixed amount of columns.
After that, I just load the new optimised file with Apache Arrow.

Thank you all,
Elisa

On Tue, 19 Nov 2019 at 20:09, Maarten Ballintijn <maartenb@xs4all.nl> wrote:

> Hi Elisa,
>
> One option is to preprocess the file and add the missing columns.
> You can do this using two passes (reading once to determine the number of
> columns
> and once writing out the lines filled out to the right number of columns)
> This does not need to take a lot of memory as  you can read line by line.
> (see
> http://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
> )
>
> Another option is to make an iterator to do this on the fly.
>
> Either way the resulting DataFrame is going to be large:
>
> 14e6 rows * 43 columns * 8 bytes/float ~ 5 Gbyte  (in the best case)
>
> Cheers,
> Maarten
>
>
>
> On Nov 19, 2019, at 4:51 AM, Antoine Pitrou <antoine@python.org> wrote:
>
>
> No, there is no way to load CSV files with irregular dimensions, and we
> don't have any plans currently to support them.  Sorry :-(
>
> Regards
>
> Antoine.
>
>
> Le 19/11/2019 à 05:54, Micah Kornfield a écrit :
>
> +dev@arrow to see if there is a more definitive answer, but I don't
> believe
> this type of functionality is supported currently.
>
>
>
>
> On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
> elisa.scandellari@gmail.com> wrote:
>
> Hi,
> I'm trying to improve the performance of my program that loads csv data
> and manipulates it.
> My CSV file contains 14 million rows and has a variable amount of columns.
> The first 27 columns will always be available, and a row can have up to 16
> more columns for a total of 43.
>
> Using vanilla pandas I've found this workaround:
> ```
>
>
>
>
>
>
>
>
>
>
> *largest_column_count = 0with open(data_file, 'r') as temp_f:    lines =
> temp_f.readlines()    for l in lines:        column_count =
> len(l.split(',')) + 1        largest_column_count = column_count if
> largest_column_count < column_count else
> largest_column_counttemp_f.close()column_names = [i for i in range(0,
> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
> regex=True)*```
> This will create the table with all my data plus empty cells where the
> data is not available.
> With a smaller file, this works perfectly well. With the complete file, my
> memory usage goes over the roof.
>
> I've been reading about Apache Arrow and, after a few attempts to load a
> structured csv file (same amount of columns for every row), I'm extremely
> impressed.
> I've tried to load my data file, using the same concept as above:
> ```
>
>
>
>
>
>
>
>
>
>
>
> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
> [str(i) for i in range(len(fixed_column_names),
> largest_column_count)]total_columns =
> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
> csv.ReadOptions(column_names=total_columns)convert_options =
> csv.ConvertOptions(include_columns=total_columns,
>           include_missing_columns=True,
> strings_can_be_null=True)table = csv.read_csv(edr_filename,
> read_options=read_options, convert_options=convert_options)*
> ```
> but I get the following error
> ****Exception: CSV parse error: Expected 43 columns, got 32****
>
> I need to use the csv provided by pyarrow, if not I wouldn't be able to
> create the pyarrow table to then convert to pandas
> ```from pyarrow import csv```
>
> I guess that the csv library provided by pyarrow is more streamlined than
> the complete one.
>
> Is there any way I can load this file? Maybe using some ReadOptions and/or
> ConvertOptions?
> I'd be using pandas to manipulate the data after it's been loaded.
>
> Thank you in advance
>
>
>
>
>

Mime
View raw message