hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CubicDesign <cubicdes...@gmail.com>
Subject My input is "plain text" but each record is split on multiple lines - I need help defining the InputFormat
Date Tue, 28 Jul 2009 21:25:57 GMT

I want to use Hadoop (Map tasks only) to process a large file. The Map 
should break the input file into records and feed each record to an 
external EXE program. In other words I don't want to do processing with 
Map/Reduce (the external EXE will do the processing) but only to use 
Hadoop to run multiple jobs in parallel over the cluster. I want to use 
Python for this.

My file is a simple TXT file but unfortunatelly one record is split on 
multiple rows. One record is looking like this:

 > some comment bla-bla

There are multiple records one after each other, separated by nothing 
else than an enter character. Rows have arbitrary lengths and there is 
an arbitrary number of rows in each record.
How can I define a InputFormat for this? Which is the best solution?
(If necessary I can write a preprocessor that will merge the non-comment 
rows in a single row.)

Any help that will point a beginner into the right direction will be 
very appreciated.
Many thanks.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message