hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: Bible Code and some input format ideas
Date Tue, 12 Jan 2010 22:37:09 GMT
I'm guessing that you want to set the width of the text to avoid the  
issue where if you split by block, then all splits but the first will  
have an unknown offset.

Most texts have natural divisions in them which I'm guessing you'll  
want to respect anyway.  In the Bible this would be the different  
books, in more recent books it would be different chapters.  Could you  
instead set up your InputFormat to split on these divisions in the  
text?  Then you don't have to go through this single threaded step.   
And in most cases the divisions in the text will be small enough to be  
handled by a single mapper (though not necessarily well balanced).


On Jan 11, 2010, at 11:52 AM, Edward Capriolo wrote:

> Hey all,
> I saw a special on discovery about bible code.
> http://en.wikipedia.org/wiki/Bible_code
> I am designing something in hadoop to do bible code on any text (not
> just the bible). I have a rough idea on how to make all the parts
> efficient in map reduce. I have a little challenge I originally
> thought I could solve with with a custom InputFormat but it seems I
> may have to do this in a stand alone program.
> Lets assume your input looks like this:
> Is there any
> bible-code in this
> text? I don't know.
> The end result might look like this ( assuming I take every 5th  
> letter.)
> irbcn
> tdn__
> The first part of the process is given an input text we have to strip
> out a user configured list of things '\t' '-' '.' '?' .  That I have
> no problem with.
> The second part of the process, I would like to get the data to be the
> proper width, in this case 5 characters. This is a challenge because
> assuming a line is 5 characters e.g. 'done?' Once it is cleaned it
> will be 4 characters  'done'. This -1 offsets changes the rest of the
> data, the next line might have another offset, so on and so on.
> Originally I was thinking I could create NCharacterInputFormat, but it
> seems like this stage of the process can not easily be done in
> map/reduce. I guess I need to write a single threaded program to read
> through the data and make the correct offsets (5 characters per line).
> Unless someone else has an idea.

View raw message