hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: Doubts in Map reduce programs
Date Sat, 01 Nov 2014 15:53:11 GMT
One way that I can think of is that you basically need to define your own
InputFormal and RecordReader so that each record is 'a paragraph' or a
'sentence'. The reason being that in regular case, a line terminated by
standard end of line characters is considered as one record for
FileInputFormat. Here, you instead want to get one paragraph as one record
instead of one line. So, once you override a RecordReader, you will have
control on how do you want to define a 'record' that is passed to each map
task.

Some starting points...E.g. look here to define and implement your own
RecordReader for FileInputFormat:
http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/
http://www.infoq.com/articles/HadoopInputFormat
http://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/

Regards,
Shahab

Regards,
Shahab

On Sat, Nov 1, 2014 at 11:45 AM, Raghavendra Chandra <
raghavchandra.learning@gmail.com> wrote:

> Hi There,
>
> I have couple of doubts in Hadoop, it would be really helpful if anyone
> can answer these questions or if this is already answered somewhere, the
> link to that would be helpful.
>
> Below are my doubts:
>
> 1. How to count the number of paragraphs in a text file using java map
> reduce ?
>
> 2. How to count the number of sentences in a paragraph/file using java map
> reduce ?
>
> Please let me know where I can get the map reduce programs list with
> different use cases.
>
> Looking forward for your responses.
>
>

Mime
View raw message