Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of siddharth.tiwari@live.com
 designates 65.55.90.87 as permitted sender)
Message-ID: <SNT142-W6434487DE9C0702FDC96A6E0BC0@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_8c0b658d-9f14-4644-85c6-243b67fbd5a6_"
From: Siddharth Tiwari <siddharth.tiwari@live.com>
To: USers Hadoop <user@hadoop.apache.org>, Bejoy Hadoop
	<bejoy.hadoop@gmail.com>, Bejoy Cloudera <bejoy_ks@yahoo.com>
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Sat, 25 Aug 2012 12:07:49 +0000
Importance: High
In-Reply-To: <SNT142-W500F7BE1C56BF7A64E54DEE0BC0@phx.gbl>
References: 
 <SNT142-W179930DB45E6334F973654E0BD0@phx.gbl>,<CAKH9108V2AWibckaSv_1DwQYcF-6_r3fvX1r9mEVnqZcyfh02Q@mail.gmail.com>,<SNT142-W156709067F77A6990726FAE0BD0@phx.gbl>,<CAKH910-8Rch+JTs-E2Qwpx4Aoh_CQf_kNfsAQQe-nOnnyP9Wfw@mail.gmail.com>,<SNT142-W237132D09D2F7AEFA94B3DE0BD0@phx.gbl>,<SNT142-W13026D4B3DCF12BB98F2A6E0BD0@phx.gbl>,<SNT142-W500F7BE1C56BF7A64E54DEE0BC0@phx.gbl>
MIME-Version: 1.0

--_8c0b658d-9f14-4644-85c6-243b67fbd5a6_
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable


CAn anybody enlighten me on what could be wrongg ?

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy=2C and devotion to duty is the highest form of worship =
of God.=94=20

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org=3B bejoy.hadoop@gmail.com=3B bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Sat=2C 25 Aug 2012 05:35:48 +0000


Any help on below would be really appreciated. i am stuck with it

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy=2C and devotion to duty is the highest form of worship =
of God.=94=20

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org=3B bejoy.hadoop@gmail.com=3B bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri=2C 24 Aug 2012 20:23:45 +0000


Hi =2C

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy=2C and devotion to duty is the highest form of worship =
of God.=94=20

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org=3B bejoy.hadoop@gmail.com=3B bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri=2C 24 Aug 2012 16:22:57 +0000


Hi Team=2C

Thanks a lot for so many good suggestions. I wrote a custom input format fo=
r reading one paragraph at a time. But when I use it I get lines read. Can =
you please suggest what changes I must make to read one para at a time sepe=
rated by null lines ?
below is the code I wrote:-


import java.io.IOException=3B
import java.util.ArrayList=3B
import java.util.regex.Matcher=3B
import java.util.regex.Pattern=3B
import java.io.IOException=3B
import java.util.ArrayList=3B
import java.util.List=3B

import org.apache.hadoop.conf.Configuration=3B
import org.apache.hadoop.fs.FSDataInputStream=3B
import org.apache.hadoop.fs.FileStatus=3B
import org.apache.hadoop.fs.FileSystem=3B
import org.apache.hadoop.fs.Path=3B
import org.apache.hadoop.io.LongWritable=3B
import org.apache.hadoop.io.Text=3B
import org.apache.hadoop.mapred.JobConf=3B
import org.apache.hadoop.mapreduce.InputSplit=3B
import org.apache.hadoop.mapreduce.Job=3B
import org.apache.hadoop.mapreduce.JobContext=3B
import org.apache.hadoop.mapreduce.RecordReader=3B
import org.apache.hadoop.mapreduce.TaskAttemptContext=3B
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat=3B
import org.apache.hadoop.mapreduce.lib.input.FileSplit=3B
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader=3B
import org.apache.hadoop.util.LineReader=3B


/**
 *=20
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable=2CText> {
private String nullRegex =3D "^\\s*$" =3B
public String StrLine =3D null=3B
/*public RecordReader<LongWritable=2C Text> getRecordReader (InputSplit gen=
ericSplit=2C JobConf job=2C Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString())=3B
return new ParaInputFormat(job=2C (FileSplit)genericSplit)=3B
}*/
public RecordReader<LongWritable=2C Text> createRecordReader(InputSplit gen=
ericSplit=2C TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString())=3B
   return new LineRecordReader()=3B
 }


public InputSplit[] getSplits(JobContext job=2C Configuration conf) throws =
IOException {
ArrayList<FileSplit> splits =3D new ArrayList<FileSplit>()=3B
for (FileStatus status : listStatus(job)) {
Path fileName =3D status.getPath()=3B
if (status.isDir()) {
throw new IOException("Not a file: " + fileName)=3B
}
FileSystem  fs =3D fileName.getFileSystem(conf)=3B
LineReader lr =3D null=3B
try {
FSDataInputStream in  =3D fs.open(fileName)=3B
lr =3D new LineReader(in=2C conf)=3B
// String regexMatch =3Din.readLine()=3B
Text line =3D new Text()=3B
long begin =3D 0=3B
long length =3D 0=3B
int num =3D -1=3B
String boolTest =3D null=3B
boolean match =3D false=3B
Pattern p =3D Pattern.compile(nullRegex)=3B
// Matcher matcher =3D new p.matcher()=3B
while ((boolTest =3D in.readLine()) !=3D null && (num =3D lr.readLine(line)=
) > 0 && ! ( in.readLine().isEmpty())){
// numLines++=3B
length +=3D num=3B
=20
=20
splits.add(new FileSplit(fileName=2C begin=2C length=2C new String[]{}))=3B=
}
begin=3Dlength=3B
}finally {
if (lr !=3D null) {
lr.close()=3B
}
=20
=20
=20
}
=20
}
return splits.toArray(new FileSplit[splits.size()])=3B
}
=20


}


*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy=2C and devotion to duty is the highest form of worship =
of God.=94=20

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri=2C 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
>=20
> Hi=2C maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer=2C but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
>=20
> Have never parsed many docx or doc files=2C but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
>=20
>=20
> -H=E5vard
>=20
> On Fri=2C Aug 24=2C 2012 at 9:30 AM=2C Siddharth Tiwari
> <siddharth.tiwari@live.com> wrote:
> > Hi=2C
> > Thank you for the suggestion. Actually I was using poi to extract text=
=2C but
> > since now  I  have so many  documents I thought I will use hadoop direc=
tly
> > to parse as well. Average size of each document is around 120 kb. Also =
I
> > want to read multiple lines from the text until I find a blank line. I =
do
> > not have any idea ankit how to design custom input format and record re=
ader.
> > Pleaser help with some tutorial tutorial=2C code or resource around it.=
 I am
> > struggling with the issue. I will be highly grateful. Thank you so much=
 once
> > again
> >
> >> Date: Fri=2C 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -H=E5vard
> >>
> >> On Fri=2C Aug 24=2C 2012 at 7:52 AM=2C Siddharth Tiwari
> >> <siddharth.tiwari@live.com> wrote:
> >> > hi=2C
> >> > I have doc files in msword doc and docx format. These have entries w=
hich
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurfo=
rmat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy=2C and devotion to duty is the highest form of w=
orship
> >> > of
> >> > God.=94
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> H=E5vard Wahl Kongsg=E5rd
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
>=20
>=20
>=20
> --=20
> H=E5vard Wahl Kongsg=E5rd
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
>=20
> http://havard.security-review.net/
 		 	   		   		 	   		   		 	   		   		 	   		  =

--_8c0b658d-9f14-4644-85c6-243b67fbd5a6_
Content-Type: text/html; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 10pt=3B
font-family:Tahoma
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>
<font style=3D"" color=3D"#17365D" face=3D"Tahoma"><br id=3D"FontBreak"></f=
ont>CAn anybody enlighten me on what could be wrongg ?<br><br><strong><font=
 color=3D"#00b050">*------------------------*</font></strong><br>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><strong>=
<u>Cheers !!!</u></strong></font><font style=3D"" color=3D"#17365D" face=3D=
"Franklin Gothic Medium"><br></font>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><strong>=
<font style=3D"">Siddharth</font> <font style=3D"">Tiwari</font></strong></=
font><font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><br=
></font>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium">Have a r=
efreshing day !!!</font><font style=3D"" face=3D"Franklin Gothic Medium"><b=
r></font><font style=3D"" color=3D"#974806" face=3D"Franklin Gothic Medium"=
><b>"Every duty is holy=2C and devotion to duty is the highest form of wors=
hip of God.=94 </b></font><br>
<b><font style=3D"" color=3D"#002060">"</font><span id=3D"ecx:1ha"><font st=
yle=3D"" color=3D"#C00000">Maybe other people will try to limit me but I do=
n't limit myself</font><font style=3D"" color=3D"#002060">"</font></span></=
b><br><br><br><div><div id=3D"SkyDrivePlaceholder"></div><hr id=3D"stopSpel=
ling">From: siddharth.tiwari@live.com<br>To: user@hadoop.apache.org=3B bejo=
y.hadoop@gmail.com=3B bejoy_ks@yahoo.com<br>Subject: RE: Reading multiple l=
ines from a microsoft doc in hadoop<br>Date: Sat=2C 25 Aug 2012 05:35:48 +0=
000<br><br>

<style><!--
.ExternalClass .ecxhmmessage P
{padding:0px=3B}
.ExternalClass body.ecxhmmessage
{font-size:10pt=3Bfont-family:Tahoma=3B}

--></style>
<div dir=3D"ltr">
<font style=3D"" color=3D"#17365D" face=3D"Tahoma"><br id=3D"ecxFontBreak">=
</font>Any help on below would be really appreciated. i am stuck with it<br=
><br><strong><font color=3D"#00b050">*------------------------*</font></str=
ong><br>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><strong>=
<u>Cheers !!!</u></strong></font><font style=3D"" color=3D"#17365D" face=3D=
"Franklin Gothic Medium"><br></font>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><strong>=
<font style=3D"">Siddharth</font> <font style=3D"">Tiwari</font></strong></=
font><font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><br=
></font>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium">Have a r=
efreshing day !!!</font><font style=3D"" face=3D"Franklin Gothic Medium"><b=
r></font><font style=3D"" color=3D"#974806" face=3D"Franklin Gothic Medium"=
><b>"Every duty is holy=2C and devotion to duty is the highest form of wors=
hip of God.=94 </b></font><br>
<b><font style=3D"" color=3D"#002060">"</font><span id=3D"ecx:1ha"><font st=
yle=3D"" color=3D"#C00000">Maybe other people will try to limit me but I do=
n't limit myself</font><font style=3D"" color=3D"#002060">"</font></span></=
b><br><br><br><div><div id=3D"ecxSkyDrivePlaceholder"></div><hr id=3D"ecxst=
opSpelling">From: siddharth.tiwari@live.com<br>To: user@hadoop.apache.org=
=3B bejoy.hadoop@gmail.com=3B bejoy_ks@yahoo.com<br>Subject: RE: Reading mu=
ltiple lines from a microsoft doc in hadoop<br>Date: Fri=2C 24 Aug 2012 20:=
23:45 +0000<br><br>

<style><!--
.ExternalClass .ecxhmmessage P
{padding:0px=3B}
.ExternalClass body.ecxhmmessage
{font-size:10pt=3Bfont-family:Tahoma=3B}

--></style>
<div dir=3D"ltr">
<font style=3D"" color=3D"#17365D" face=3D"Tahoma">Hi =2C<br><br>Can anyone=
 please help ?<br><br>Thank you in advance<br id=3D"ecxFontBreak"></font><b=
r><br><strong><font color=3D"#00b050">*------------------------*</font></st=
rong><br>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><strong>=
<u>Cheers !!!</u></strong></font><font style=3D"" color=3D"#17365D" face=3D=
"Franklin Gothic Medium"><br></font>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><strong>=
<font style=3D"">Siddharth</font> <font style=3D"">Tiwari</font></strong></=
font><font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><br=
></font>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium">Have a r=
efreshing day !!!</font><font style=3D"" face=3D"Franklin Gothic Medium"><b=
r></font><font style=3D"" color=3D"#974806" face=3D"Franklin Gothic Medium"=
><b>"Every duty is holy=2C and devotion to duty is the highest form of wors=
hip of God.=94 </b></font><br>
<b><font style=3D"" color=3D"#002060">"</font><span id=3D"ecx:1ha"><font st=
yle=3D"" color=3D"#C00000">Maybe other people will try to limit me but I do=
n't limit myself</font><font style=3D"" color=3D"#002060">"</font></span></=
b><br><br><br><div><div id=3D"ecxSkyDrivePlaceholder"></div><hr id=3D"ecxst=
opSpelling">From: siddharth.tiwari@live.com<br>To: user@hadoop.apache.org=
=3B bejoy.hadoop@gmail.com=3B bejoy_ks@yahoo.com<br>Subject: RE: Reading mu=
ltiple lines from a microsoft doc in hadoop<br>Date: Fri=2C 24 Aug 2012 16:=
22:57 +0000<br><br>

<style><!--
.ExternalClass .ecxhmmessage P
{padding:0px=3B}
.ExternalClass body.ecxhmmessage
{font-size:10pt=3Bfont-family:Tahoma=3B}

--></style>
<div dir=3D"ltr">
<font style=3D"" color=3D"#17365D" face=3D"Tahoma">Hi Team=2C<br><br>Thanks=
 a lot for so many good suggestions. I wrote a custom input format for read=
ing one paragraph at a time. But when I use it I get lines read. Can you pl=
ease suggest what changes I must make to read one para at a time seperated =
by null lines ?<br>below is the code I wrote:-<br><br><br>import java.io.IO=
Exception=3B<br>import java.util.ArrayList=3B<br>import java.util.regex.Mat=
cher=3B<br>import java.util.regex.Pattern=3B<br>import java.io.IOException=
=3B<br>import java.util.ArrayList=3B<br>import java.util.List=3B<br><br>imp=
ort org.apache.hadoop.conf.Configuration=3B<br>import org.apache.hadoop.fs.=
FSDataInputStream=3B<br>import org.apache.hadoop.fs.FileStatus=3B<br>import=
 org.apache.hadoop.fs.FileSystem=3B<br>import org.apache.hadoop.fs.Path=3B<=
br>import org.apache.hadoop.io.LongWritable=3B<br>import org.apache.hadoop.=
io.Text=3B<br>import org.apache.hadoop.mapred.JobConf=3B<br>import org.apac=
he.hadoop.mapreduce.InputSplit=3B<br>import org.apache.hadoop.mapreduce.Job=
=3B<br>import org.apache.hadoop.mapreduce.JobContext=3B<br>import org.apach=
e.hadoop.mapreduce.RecordReader=3B<br>import org.apache.hadoop.mapreduce.Ta=
skAttemptContext=3B<br>import org.apache.hadoop.mapreduce.lib.input.FileInp=
utFormat=3B<br>import org.apache.hadoop.mapreduce.lib.input.FileSplit=3B<br=
>import org.apache.hadoop.mapreduce.lib.input.LineRecordReader=3B<br>import=
 org.apache.hadoop.util.LineReader=3B<br><br><br><br><br>/**<br>&nbsp=3B* <=
br>&nbsp=3B*/<br><br>/**<br>&nbsp=3B* @author 460615<br>&nbsp=3B*<br>&nbsp=
=3B*/<br>//FileInputFormat is the base class for all file-based InputFormat=
s<br>public class ParaInputFormat extends FileInputFormat&lt=3BLongWritable=
=2CText&gt=3B {<br>private String nullRegex =3D "^\\s*$" =3B<br>public Stri=
ng StrLine =3D null=3B<br>/*public RecordReader&lt=3BLongWritable=2C Text&g=
t=3B getRecordReader (InputSplit genericSplit=2C JobConf job=2C Reporter re=
porter) throws IOException {<br>reporter.setStatus(genericSplit.toString())=
=3B<br>return new ParaInputFormat(job=2C (FileSplit)genericSplit)=3B<br>}*/=
<br>public RecordReader&lt=3BLongWritable=2C Text&gt=3B createRecordReader(=
InputSplit genericSplit=2C TaskAttemptContext context)throws IOException {<=
br>&nbsp=3B&nbsp=3B context.setStatus(genericSplit.toString())=3B<br>&nbsp=
=3B&nbsp=3B return new LineRecordReader()=3B<br>&nbsp=3B}<br><br><br>public=
 InputSplit[] getSplits(JobContext job=2C Configuration conf) throws IOExce=
ption {<br>ArrayList&lt=3BFileSplit&gt=3B splits =3D new ArrayList&lt=3BFil=
eSplit&gt=3B()=3B<br>for (FileStatus status : listStatus(job)) {<br>Path fi=
leName =3D status.getPath()=3B<br>if (status.isDir()) {<br>throw new IOExce=
ption("Not a file: " + fileName)=3B<br>}<br>FileSystem&nbsp=3B fs =3D fileN=
ame.getFileSystem(conf)=3B<br>LineReader lr =3D null=3B<br>try {<br>FSDataI=
nputStream in&nbsp=3B =3D fs.open(fileName)=3B<br>lr =3D new LineReader(in=
=2C conf)=3B<br>// String regexMatch =3Din.readLine()=3B<br>Text line =3D n=
ew Text()=3B<br>long begin =3D 0=3B<br>long length =3D 0=3B<br>int num =3D =
-1=3B<br>String boolTest =3D null=3B<br>boolean match =3D false=3B<br>Patte=
rn p =3D Pattern.compile(nullRegex)=3B<br>// Matcher matcher =3D new p.matc=
her()=3B<br>while ((boolTest =3D in.readLine()) !=3D null &amp=3B&amp=3B (n=
um =3D lr.readLine(line)) &gt=3B 0 &amp=3B&amp=3B ! ( in.readLine().isEmpty=
())){<br>// numLines++=3B<br>length +=3D num=3B<br>&nbsp=3B<br>&nbsp=3B<br>=
splits.add(new FileSplit(fileName=2C begin=2C length=2C new String[]{}))=3B=
}<br>begin=3Dlength=3B<br>}finally {<br>if (lr !=3D null) {<br>lr.close()=
=3B<br>}<br>&nbsp=3B<br>&nbsp=3B<br>&nbsp=3B<br>}<br>&nbsp=3B<br>}<br>retur=
n splits.toArray(new FileSplit[splits.size()])=3B<br>}<br>&nbsp=3B<br><br><=
br>}<br><br><br><br id=3D"ecxFontBreak"></font><br><br><strong><font color=
=3D"#00b050">*------------------------*</font></strong><br>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><strong>=
<u>Cheers !!!</u></strong></font><font style=3D"" color=3D"#17365D" face=3D=
"Franklin Gothic Medium"><br></font>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><strong>=
<font style=3D"">Siddharth</font> <font style=3D"">Tiwari</font></strong></=
font><font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium"><br=
></font>
<font style=3D"" color=3D"#17365D" face=3D"Franklin Gothic Medium">Have a r=
efreshing day !!!</font><font style=3D"" face=3D"Franklin Gothic Medium"><b=
r></font><font style=3D"" color=3D"#974806" face=3D"Franklin Gothic Medium"=
><b>"Every duty is holy=2C and devotion to duty is the highest form of wors=
hip of God.=94 </b></font><br>
<b><font style=3D"" color=3D"#002060">"</font><span id=3D"ecx:1ha"><font st=
yle=3D"" color=3D"#C00000">Maybe other people will try to limit me but I do=
n't limit myself</font><font style=3D"" color=3D"#002060">"</font></span></=
b><br><br><br><div><div id=3D"ecxSkyDrivePlaceholder"></div>&gt=3B Date: Fr=
i=2C 24 Aug 2012 09:54:10 +0200<br>&gt=3B Subject: Re: Reading multiple lin=
es from a microsoft doc in hadoop<br>&gt=3B From: haavard.kongsgaard@gmail.=
com<br>&gt=3B To: user@hadoop.apache.org<br>&gt=3B <br>&gt=3B Hi=2C maybe y=
ou should check out the old nutch project<br>&gt=3B http://nutch.apache.org=
/ (hadoop was developed for nutch).<br>&gt=3B It's a web crawler and indexe=
r=2C but the malinglists hold much info<br>&gt=3B doc/pdf parsing which als=
o relates to hadoop.<br>&gt=3B <br>&gt=3B Have never parsed many docx or do=
c files=2C but it should be<br>&gt=3B strait-forward. But generally for tex=
t analysis preprocessing is the<br>&gt=3B KEY! For example replace dual lin=
es \r\n\r\n or (\n\n) with #### is a<br>&gt=3B simple trick)<br>&gt=3B <br>=
&gt=3B <br>&gt=3B -H=E5vard<br>&gt=3B <br>&gt=3B On Fri=2C Aug 24=2C 2012 a=
t 9:30 AM=2C Siddharth Tiwari<br>&gt=3B &lt=3Bsiddharth.tiwari@live.com&gt=
=3B wrote:<br>&gt=3B &gt=3B Hi=2C<br>&gt=3B &gt=3B Thank you for the sugges=
tion. Actually I was using poi to extract text=2C but<br>&gt=3B &gt=3B sinc=
e now  I  have so many  documents I thought I will use hadoop directly<br>&=
gt=3B &gt=3B to parse as well. Average size of each document is around 120 =
kb. Also I<br>&gt=3B &gt=3B want to read multiple lines from the text until=
 I find a blank line. I do<br>&gt=3B &gt=3B not have any idea ankit how to =
design custom input format and record reader.<br>&gt=3B &gt=3B Pleaser help=
 with some tutorial tutorial=2C code or resource around it. I am<br>&gt=3B =
&gt=3B struggling with the issue. I will be highly grateful. Thank you so m=
uch once<br>&gt=3B &gt=3B again<br>&gt=3B &gt=3B<br>&gt=3B &gt=3B&gt=3B Dat=
e: Fri=2C 24 Aug 2012 08:07:39 +0200<br>&gt=3B &gt=3B&gt=3B Subject: Re: Re=
ading multiple lines from a microsoft doc in hadoop<br>&gt=3B &gt=3B&gt=3B =
From: haavard.kongsgaard@gmail.com<br>&gt=3B &gt=3B&gt=3B To: user@hadoop.a=
pache.org<br>&gt=3B &gt=3B<br>&gt=3B &gt=3B&gt=3B<br>&gt=3B &gt=3B&gt=3B It=
's much easier if you convert the documents to text first<br>&gt=3B &gt=3B&=
gt=3B<br>&gt=3B &gt=3B&gt=3B use<br>&gt=3B &gt=3B&gt=3B http://tika.apache.=
org/<br>&gt=3B &gt=3B&gt=3B<br>&gt=3B &gt=3B&gt=3B or some other doc parser=
<br>&gt=3B &gt=3B&gt=3B<br>&gt=3B &gt=3B&gt=3B<br>&gt=3B &gt=3B&gt=3B -H=E5=
vard<br>&gt=3B &gt=3B&gt=3B<br>&gt=3B &gt=3B&gt=3B On Fri=2C Aug 24=2C 2012=
 at 7:52 AM=2C Siddharth Tiwari<br>&gt=3B &gt=3B&gt=3B &lt=3Bsiddharth.tiwa=
ri@live.com&gt=3B wrote:<br>&gt=3B &gt=3B&gt=3B &gt=3B hi=2C<br>&gt=3B &gt=
=3B&gt=3B &gt=3B I have doc files in msword doc and docx format. These have=
 entries which<br>&gt=3B &gt=3B&gt=3B &gt=3B are<br>&gt=3B &gt=3B&gt=3B &gt=
=3B seperated by an empty line. Is it possible for me to read<br>&gt=3B &gt=
=3B&gt=3B &gt=3B these lines separated from empty lines at a time. Also whi=
ch inpurformat<br>&gt=3B &gt=3B&gt=3B &gt=3B shall I use to read doc docx. =
Please help<br>&gt=3B &gt=3B&gt=3B &gt=3B<br>&gt=3B &gt=3B&gt=3B &gt=3B *--=
----------------------*<br>&gt=3B &gt=3B&gt=3B &gt=3B Cheers !!!<br>&gt=3B =
&gt=3B&gt=3B &gt=3B Siddharth Tiwari<br>&gt=3B &gt=3B&gt=3B &gt=3B Have a r=
efreshing day !!!<br>&gt=3B &gt=3B&gt=3B &gt=3B "Every duty is holy=2C and =
devotion to duty is the highest form of worship<br>&gt=3B &gt=3B&gt=3B &gt=
=3B of<br>&gt=3B &gt=3B&gt=3B &gt=3B God.=94<br>&gt=3B &gt=3B&gt=3B &gt=3B =
"Maybe other people will try to limit me but I don't limit myself"<br>&gt=
=3B &gt=3B&gt=3B<br>&gt=3B &gt=3B&gt=3B<br>&gt=3B &gt=3B&gt=3B<br>&gt=3B &g=
t=3B&gt=3B --<br>&gt=3B &gt=3B&gt=3B H=E5vard Wahl Kongsg=E5rd<br>&gt=3B &g=
t=3B&gt=3B Faculty of Medicine &amp=3B<br>&gt=3B &gt=3B&gt=3B Department of=
 Mathematical Sciences<br>&gt=3B &gt=3B&gt=3B NTNU<br>&gt=3B &gt=3B&gt=3B<b=
r>&gt=3B &gt=3B&gt=3B http://havard.security-review.net/<br>&gt=3B <br>&gt=
=3B <br>&gt=3B <br>&gt=3B -- <br>&gt=3B H=E5vard Wahl Kongsg=E5rd<br>&gt=3B=
 Faculty of Medicine &amp=3B<br>&gt=3B Department of Mathematical Sciences<=
br>&gt=3B NTNU<br>&gt=3B <br>&gt=3B http://havard.security-review.net/<br><=
/div> 		 	   		  </div></div> 		 	   		  </div></div> 		 	   		  </div></di=
v> 		 	   		  </div></body>
</html>=

--_8c0b658d-9f14-4644-85c6-243b67fbd5a6_--