hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Watt <sw...@us.ibm.com>
Subject Re: Bible Code and some input format ideas
Date Tue, 12 Jan 2010 15:45:16 GMT
Neat !  Please keep the list appraised when you have something to demo.

Kind regards
Steve Watt



From:
Edward Capriolo <edlinuxguru@gmail.com>
To:
common-user@hadoop.apache.org
Date:
01/11/2010 01:55 PM
Subject:
Bible Code and some input format ideas



Hey all,
I saw a special on discovery about bible code.
http://en.wikipedia.org/wiki/Bible_code

I am designing something in hadoop to do bible code on any text (not
just the bible). I have a rough idea on how to make all the parts
efficient in map reduce. I have a little challenge I originally
thought I could solve with with a custom InputFormat but it seems I
may have to do this in a stand alone program.

Lets assume your input looks like this:

Is there any
bible-code in this
text? I don't know.

The end result might look like this ( assuming I take every 5th letter.)

irbcn
tdn__

The first part of the process is given an input text we have to strip
out a user configured list of things '\t' '-' '.' '?' .  That I have
no problem with.

The second part of the process, I would like to get the data to be the
proper width, in this case 5 characters. This is a challenge because
assuming a line is 5 characters e.g. 'done?' Once it is cleaned it
will be 4 characters  'done'. This -1 offsets changes the rest of the
data, the next line might have another offset, so on and so on.

Originally I was thinking I could create NCharacterInputFormat, but it
seems like this stage of the process can not easily be done in
map/reduce. I guess I need to write a single threaded program to read
through the data and make the correct offsets (5 characters per line).
Unless someone else has an idea.



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message