beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Hausel (JIRA)" <j...@apache.org>
Subject [jira] [Created] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range
Date Mon, 28 Aug 2017 21:05:00 GMT
Peter Hausel created BEAM-2815:
----------------------------------

             Summary: Python DirectRunner is unusable with input files in the 100-250MB range
                 Key: BEAM-2815
                 URL: https://issues.apache.org/jira/browse/BEAM-2815
             Project: Beam
          Issue Type: Bug
          Components: runner-direct
    Affects Versions: 2.1.0
         Environment: python 2.7.10, beam 2.1, os x 
            Reporter: Peter Hausel
            Assignee: Thomas Groh


The current python DirectRunner implementation seems to be unusable with training data sets
that are bigger than tiny samples - making serious local development impossible or very cumbersome.
I am aware of some of the limitations of the current DirectRunner implementation[1][2][3],
however I was not sure if this odd behavior is expected.


[1][2][3]
https://stackoverflow.com/a/44765621
https://issues.apache.org/jira/browse/BEAM-1442
https://beam.apache.org/documentation/runners/direct/

Repro:
The simple script below blew up my laptop (MBP 2015) and had to terminate the process after
10 minutes or so (screenshots about high memory and CPU utilization are also attached).

{code}
from apache_beam.io import textio
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
import argparse

def run(argv=None):
     """Main entry point; defines and runs the wordcount pipeline."""
     parser = argparse.ArgumentParser()
     parser.add_argument('--input',
                      dest='input',
                      default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
                      help='Input file to process.')
     known_args, pipeline_args = parser.parse_known_args(argv)
     pipeline_options = PipelineOptions(pipeline_args)
     pipeline_options.view_as(SetupOptions).save_main_session = True
     pipeline = beam.Pipeline(options=pipeline_options)
     raw_data = (
           pipeline
           | 'ReadTrainData' >> textio.ReadFromText(known_args.input, skip_header_lines=1)
           | 'Map' >> beam.Map(lambda line: line.lower())
     )
     result = pipeline.run()
     result.wait_until_finish()
     print(raw_data)

if __name__ == '__main__':
  run()
{code}

Example dataset:  https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009

for comparison: 

{code}
lines = [line.lower() for line in open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
print(len(lines))
{code}

this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message