hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siddharth Tiwari <siddharth.tiw...@live.com>
Subject RE: Reading multiple lines from a microsoft doc in hadoop
Date Fri, 24 Aug 2012 16:22:57 GMT

Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph
at a time. But when I use it I get lines read. Can you please suggest what changes I must
make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit,
JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit,
TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 &&
! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <siddharth.tiwari@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <siddharth.tiwari@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		  
Mime
View raw message