beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmet Altay (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range
Date Mon, 28 Aug 2017 21:12:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144358#comment-16144358
] 

Ahmet Altay commented on BEAM-2815:
-----------------------------------

This is a known issue and tracked in the issue you mentioned (https://issues.apache.org/jira/browse/BEAM-1442)

There are series of improvement that could be applied. Would you be interested helping in
this area?

cc: [~charleschen]

> Python DirectRunner is unusable with input files in the 100-250MB range
> -----------------------------------------------------------------------
>
>                 Key: BEAM-2815
>                 URL: https://issues.apache.org/jira/browse/BEAM-2815
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-direct, sdk-py
>    Affects Versions: 2.1.0
>         Environment: python 2.7.10, beam 2.1, os x 
>            Reporter: Peter Hausel
>            Assignee: Ahmet Altay
>         Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot 2017-08-27
at 9.06.00 AM.png
>
>
> The current python DirectRunner implementation seems to be unusable with training data
sets that are bigger than tiny samples - making serious local development impossible or very
cumbersome. I am aware of some of the limitations of the current DirectRunner implementation[1][2][3],
however I was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the process
after 10 minutes or so (screenshots about high memory and CPU utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
>      """Main entry point; defines and runs the pipeline."""
>      parser = argparse.ArgumentParser()
>      parser.add_argument('--input',
>                       dest='input',
>                       default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>                       help='Input file to process.')
>      known_args, pipeline_args = parser.parse_known_args(argv)
>      pipeline_options = PipelineOptions(pipeline_args)
>      pipeline_options.view_as(SetupOptions).save_main_session = True
>      pipeline = beam.Pipeline(options=pipeline_options)
>      raw_data = (
>            pipeline
>            | 'ReadTrainData' >> textio.ReadFromText(known_args.input, skip_header_lines=1)
>            | 'Map' >> beam.Map(lambda line: line.lower())
>      )
>      result = pipeline.run()
>      result.wait_until_finish()
>      print(raw_data)
> if __name__ == '__main__':
>   run()
> {code}
> Example dataset:  https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison: 
> {code}
> lines = [line.lower() for line in open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message