hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Reading multiple lines from a microsoft doc in hadoop
Date Sat, 25 Aug 2012 18:17:19 GMT
Hi Siddharth,

First of all, please understand the medium - Mailing lists aren't
immediate or interactive help mediums, please be patient for the ones
who help you out of their own time. Secondly, take a read of
http://www.catb.org/~esr/faqs/smart-questions.html for understanding
why certain etiquette is beneficial to both ends.

Your requirement here seems to be that you want to read all text in a
file, in records separated by two newlines. Depending on the version
of Hadoop you use, I think you can probably set
"textinputformat.record.delimiter" to "\n\n" or "\r\n\r\n" to have
this working with the TextInputFormat itself.

On Sat, Aug 25, 2012 at 5:37 PM, Siddharth Tiwari
<siddharth.tiwari@live.com> wrote:
>
> CAn anybody enlighten me on what could be wrongg ?
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Sat, 25 Aug 2012 05:35:48 +0000
>
>
>
> Any help on below would be really appreciated. i am stuck with it
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Fri, 24 Aug 2012 20:23:45 +0000
>
> Hi ,
>
> Can anyone please help ?
>
> Thank you in advance
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Fri, 24 Aug 2012 16:22:57 +0000
>
> Hi Team,
>
> Thanks a lot for so many good suggestions. I wrote a custom input format for
> reading one paragraph at a time. But when I use it I get lines read. Can you
> please suggest what changes I must make to read one para at a time seperated
> by null lines ?
> below is the code I wrote:-
>
>
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FSDataInputStream;
> import org.apache.hadoop.fs.FileStatus;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.JobConf;
> import org.apache.hadoop.mapreduce.InputSplit;
> import org.apache.hadoop.mapreduce.Job;
> import org.apache.hadoop.mapreduce.JobContext;
> import org.apache.hadoop.mapreduce.RecordReader;
> import org.apache.hadoop.mapreduce.TaskAttemptContext;
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
> import org.apache.hadoop.mapreduce.lib.input.FileSplit;
> import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
> import org.apache.hadoop.util.LineReader;
>
>
>
>
> /**
>  *
>  */
>
> /**
>  * @author 460615
>  *
>  */
> //FileInputFormat is the base class for all file-based InputFormats
> public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
> private String nullRegex = "^\\s*$" ;
> public String StrLine = null;
> /*public RecordReader<LongWritable, Text> getRecordReader (InputSplit
> genericSplit, JobConf job, Reporter reporter) throws IOException {
> reporter.setStatus(genericSplit.toString());
> return new ParaInputFormat(job, (FileSplit)genericSplit);
> }*/
> public RecordReader<LongWritable, Text> createRecordReader(InputSplit
> genericSplit, TaskAttemptContext context)throws IOException {
>    context.setStatus(genericSplit.toString());
>    return new LineRecordReader();
>  }
>
>
> public InputSplit[] getSplits(JobContext job, Configuration conf) throws
> IOException {
> ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
> for (FileStatus status : listStatus(job)) {
> Path fileName = status.getPath();
> if (status.isDir()) {
> throw new IOException("Not a file: " + fileName);
> }
> FileSystem  fs = fileName.getFileSystem(conf);
> LineReader lr = null;
> try {
> FSDataInputStream in  = fs.open(fileName);
> lr = new LineReader(in, conf);
> // String regexMatch =in.readLine();
> Text line = new Text();
> long begin = 0;
> long length = 0;
> int num = -1;
> String boolTest = null;
> boolean match = false;
> Pattern p = Pattern.compile(nullRegex);
> // Matcher matcher = new p.matcher();
> while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0
> && ! ( in.readLine().isEmpty())){
> // numLines++;
> length += num;
>
>
> splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
> begin=length;
> }finally {
> if (lr != null) {
> lr.close();
> }
>
>
>
> }
>
> }
> return splits.toArray(new FileSplit[splits.size()]);
> }
>
>
>
> }
>
>
>
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
>> Date: Fri, 24 Aug 2012 09:54:10 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: haavard.kongsgaard@gmail.com
>> To: user@hadoop.apache.org
>>
>> Hi, maybe you should check out the old nutch project
>> http://nutch.apache.org/ (hadoop was developed for nutch).
>> It's a web crawler and indexer, but the malinglists hold much info
>> doc/pdf parsing which also relates to hadoop.
>>
>> Have never parsed many docx or doc files, but it should be
>> strait-forward. But generally for text analysis preprocessing is the
>> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
>> simple trick)
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
>> <siddharth.tiwari@live.com> wrote:
>> > Hi,
>> > Thank you for the suggestion. Actually I was using poi to extract text,
>> > but
>> > since now I have so many documents I thought I will use hadoop directly
>> > to parse as well. Average size of each document is around 120 kb. Also I
>> > want to read multiple lines from the text until I find a blank line. I
>> > do
>> > not have any idea ankit how to design custom input format and record
>> > reader.
>> > Pleaser help with some tutorial tutorial, code or resource around it. I
>> > am
>> > struggling with the issue. I will be highly grateful. Thank you so much
>> > once
>> > again
>> >
>> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> >> From: haavard.kongsgaard@gmail.com
>> >> To: user@hadoop.apache.org
>> >
>> >>
>> >> It's much easier if you convert the documents to text first
>> >>
>> >> use
>> >> http://tika.apache.org/
>> >>
>> >> or some other doc parser
>> >>
>> >>
>> >> -Håvard
>> >>
>> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> >> <siddharth.tiwari@live.com> wrote:
>> >> > hi,
>> >> > I have doc files in msword doc and docx format. These have entries
>> >> > which
>> >> > are
>> >> > seperated by an empty line. Is it possible for me to read
>> >> > these lines separated from empty lines at a time. Also which
>> >> > inpurformat
>> >> > shall I use to read doc docx. Please help
>> >> >
>> >> > *------------------------*
>> >> > Cheers !!!
>> >> > Siddharth Tiwari
>> >> > Have a refreshing day !!!
>> >> > "Every duty is holy, and devotion to duty is the highest form of
>> >> > worship
>> >> > of
>> >> > God.”
>> >> > "Maybe other people will try to limit me but I don't limit myself"
>> >>
>> >>
>> >>
>> >> --
>> >> Håvard Wahl Kongsgård
>> >> Faculty of Medicine &
>> >> Department of Mathematical Sciences
>> >> NTNU
>> >>
>> >> http://havard.security-review.net/
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/



-- 
Harsh J

Mime
View raw message