hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CubicDesign <cubicdes...@gmail.com>
Subject Re: My input is "plain text" but each record is split on multiple lines - I need help defining the InputFormat
Date Fri, 31 Jul 2009 00:10:49 GMT
Hi Chuck.
Thanks a lot for your answer.

 > Depending on your application that's either no big deal or a deal 

Ups (big UPS!). This is really bad news for me.
I see only two solutions: either to pre-process my files and put 
everything on a single row (so the boundary is not a problem anymore), 
either to switch to Java and make a RecordReader as you said in order to 
properly read the records. But in this last case I suppose I cannot send 
the records to by EXE file anymore (through streaming). Right?

> Mind to share some background on your application?

Well, this is just a beginning. If this isn't going to work, then there is not much to tell
We need to build a new sequence processing tool that is suppose to replace and old tool (which
can't handle large amounts of data anymore). For the beginning we want to see if Hadoop can
be used to run old biology tools in parallel to speed up the whole process. More exactly,
we want to replace cluster management software (like Sun Grid Engine) with Hadoop. Later we
were suppose to add additional features to post-process the data generated by those biology

As I said, all these were only plans. We will see what will happen now. If it works, I will
be return to post a link to what we have archived.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message