hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Bible Code and some input format ideas
Date Tue, 12 Jan 2010 22:49:38 GMT
On Tue, Jan 12, 2010 at 5:37 PM, Alan Gates <gates@yahoo-inc.com> wrote:
> I'm guessing that you want to set the width of the text to avoid the issue
> where if you split by block, then all splits but the first will have an
> unknown offset.
>
> Most texts have natural divisions in them which I'm guessing you'll want to
> respect anyway.  In the Bible this would be the different books, in more
> recent books it would be different chapters.  Could you instead set up your
> InputFormat to split on these divisions in the text?  Then you don't have to
> go through this single threaded step.  And in most cases the divisions in
> the text will be small enough to be handled by a single mapper (though not
> necessarily well balanced).
>
> Alan.
>
> On Jan 11, 2010, at 11:52 AM, Edward Capriolo wrote:
>
>> Hey all,
>> I saw a special on discovery about bible code.
>> http://en.wikipedia.org/wiki/Bible_code
>>
>> I am designing something in hadoop to do bible code on any text (not
>> just the bible). I have a rough idea on how to make all the parts
>> efficient in map reduce. I have a little challenge I originally
>> thought I could solve with with a custom InputFormat but it seems I
>> may have to do this in a stand alone program.
>>
>> Lets assume your input looks like this:
>>
>> Is there any
>> bible-code in this
>> text? I don't know.
>>
>> The end result might look like this ( assuming I take every 5th letter.)
>>
>> irbcn
>> tdn__
>>
>> The first part of the process is given an input text we have to strip
>> out a user configured list of things '\t' '-' '.' '?' .  That I have
>> no problem with.
>>
>> The second part of the process, I would like to get the data to be the
>> proper width, in this case 5 characters. This is a challenge because
>> assuming a line is 5 characters e.g. 'done?' Once it is cleaned it
>> will be 4 characters  'done'. This -1 offsets changes the rest of the
>> data, the next line might have another offset, so on and so on.
>>
>> Originally I was thinking I could create NCharacterInputFormat, but it
>> seems like this stage of the process can not easily be done in
>> map/reduce. I guess I need to write a single threaded program to read
>> through the data and make the correct offsets (5 characters per line).
>> Unless someone else has an idea.
>
>

Alan,

You could split by book, there is no standard way to do it really. I
spoke to someone off list that told me a single threaded program was
able to make skips from 2-3000, starting at each different offset of
moby dick in 12 minutes. Given that a text copy of the bible was 4MB,
your conclusion

> And in most cases the divisions in
> the text will be small enough to be handled by a single mapper (though not
> necessarily well balanced).

is correct. I am thinking to have each map reduce process work with
the entire data and just one skip

mapper1 bible.txt skip2 all permutations
mapper1 bible.txt skip3 all permutations
mapper1 bible.txt skip4 all permutations
mapper1 bible.txt skipN all permutations

So map process might collect something lie

"bible.txt", 2 ,   [grid as bytes] [grid as bytes]
"bible.txt", 3 ,   [grid as bytes] [grid as bytes] [grid as bytes]

Since I have been told making the permutations take 12 minutes single
threaded, I am not going to over think it :) Hopefully doing the word
search on the grids will give hadoop more of a workout :)

Mime
View raw message