beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Hausel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range
Date Mon, 28 Aug 2017 21:06:00 GMT

     [ https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Peter Hausel updated BEAM-2815:
-------------------------------
    Description: 
The current python DirectRunner implementation seems to be unusable with training data sets
that are bigger than tiny samples - making serious local development impossible or very cumbersome.
I am aware of some of the limitations of the current DirectRunner implementation[1][2][3],
however I was not sure if this odd behavior is expected.


[1][2][3]
https://stackoverflow.com/a/44765621
https://issues.apache.org/jira/browse/BEAM-1442
https://beam.apache.org/documentation/runners/direct/

Repro:
The simple script below blew up my laptop (MBP 2015) and had to terminate the process after
10 minutes or so (screenshots about high memory and CPU utilization are also attached).

{code}
from apache_beam.io import textio
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
import argparse

def run(argv=None):
     """Main entry point; defines and runs the pipeline."""
     parser = argparse.ArgumentParser()
     parser.add_argument('--input',
                      dest='input',
                      default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
                      help='Input file to process.')
     known_args, pipeline_args = parser.parse_known_args(argv)
     pipeline_options = PipelineOptions(pipeline_args)
     pipeline_options.view_as(SetupOptions).save_main_session = True
     pipeline = beam.Pipeline(options=pipeline_options)
     raw_data = (
           pipeline
           | 'ReadTrainData' >> textio.ReadFromText(known_args.input, skip_header_lines=1)
           | 'Map' >> beam.Map(lambda line: line.lower())
     )
     result = pipeline.run()
     result.wait_until_finish()
     print(raw_data)

if __name__ == '__main__':
  run()
{code}

Example dataset:  https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009

for comparison: 

{code}
lines = [line.lower() for line in open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
print(len(lines))
{code}

this vanilla python script runs on the same hardware and dataset in 0m4.909s. 

  was:
The current python DirectRunner implementation seems to be unusable with training data sets
that are bigger than tiny samples - making serious local development impossible or very cumbersome.
I am aware of some of the limitations of the current DirectRunner implementation[1][2][3],
however I was not sure if this odd behavior is expected.


[1][2][3]
https://stackoverflow.com/a/44765621
https://issues.apache.org/jira/browse/BEAM-1442
https://beam.apache.org/documentation/runners/direct/

Repro:
The simple script below blew up my laptop (MBP 2015) and had to terminate the process after
10 minutes or so (screenshots about high memory and CPU utilization are also attached).

{code}
from apache_beam.io import textio
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
import argparse

def run(argv=None):
     """Main entry point; defines and runs the wordcount pipeline."""
     parser = argparse.ArgumentParser()
     parser.add_argument('--input',
                      dest='input',
                      default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
                      help='Input file to process.')
     known_args, pipeline_args = parser.parse_known_args(argv)
     pipeline_options = PipelineOptions(pipeline_args)
     pipeline_options.view_as(SetupOptions).save_main_session = True
     pipeline = beam.Pipeline(options=pipeline_options)
     raw_data = (
           pipeline
           | 'ReadTrainData' >> textio.ReadFromText(known_args.input, skip_header_lines=1)
           | 'Map' >> beam.Map(lambda line: line.lower())
     )
     result = pipeline.run()
     result.wait_until_finish()
     print(raw_data)

if __name__ == '__main__':
  run()
{code}

Example dataset:  https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009

for comparison: 

{code}
lines = [line.lower() for line in open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
print(len(lines))
{code}

this vanilla python script runs on the same hardware and dataset in 0m4.909s. 


> Python DirectRunner is unusable with input files in the 100-250MB range
> -----------------------------------------------------------------------
>
>                 Key: BEAM-2815
>                 URL: https://issues.apache.org/jira/browse/BEAM-2815
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-direct
>    Affects Versions: 2.1.0
>         Environment: python 2.7.10, beam 2.1, os x 
>            Reporter: Peter Hausel
>            Assignee: Thomas Groh
>
> The current python DirectRunner implementation seems to be unusable with training data
sets that are bigger than tiny samples - making serious local development impossible or very
cumbersome. I am aware of some of the limitations of the current DirectRunner implementation[1][2][3],
however I was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the process
after 10 minutes or so (screenshots about high memory and CPU utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
>      """Main entry point; defines and runs the pipeline."""
>      parser = argparse.ArgumentParser()
>      parser.add_argument('--input',
>                       dest='input',
>                       default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>                       help='Input file to process.')
>      known_args, pipeline_args = parser.parse_known_args(argv)
>      pipeline_options = PipelineOptions(pipeline_args)
>      pipeline_options.view_as(SetupOptions).save_main_session = True
>      pipeline = beam.Pipeline(options=pipeline_options)
>      raw_data = (
>            pipeline
>            | 'ReadTrainData' >> textio.ReadFromText(known_args.input, skip_header_lines=1)
>            | 'Map' >> beam.Map(lambda line: line.lower())
>      )
>      result = pipeline.run()
>      result.wait_until_finish()
>      print(raw_data)
> if __name__ == '__main__':
>   run()
> {code}
> Example dataset:  https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison: 
> {code}
> lines = [line.lower() for line in open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message