Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D6A4D9125 for ; Sat, 25 Aug 2012 12:08:28 +0000 (UTC) Received: (qmail 72167 invoked by uid 500); 25 Aug 2012 12:08:23 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 71476 invoked by uid 500); 25 Aug 2012 12:08:19 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 71433 invoked by uid 99); 25 Aug 2012 12:08:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Aug 2012 12:08:17 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of siddharth.tiwari@live.com designates 65.55.90.87 as permitted sender) Received: from [65.55.90.87] (HELO snt0-omc2-s12.snt0.hotmail.com) (65.55.90.87) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Aug 2012 12:08:11 +0000 Received: from SNT142-W64 ([65.55.90.72]) by snt0-omc2-s12.snt0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Sat, 25 Aug 2012 05:07:50 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_8c0b658d-9f14-4644-85c6-243b67fbd5a6_" X-Originating-IP: [180.151.110.42] From: Siddharth Tiwari To: USers Hadoop , Bejoy Hadoop , Bejoy Cloudera Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Sat, 25 Aug 2012 12:07:49 +0000 Importance: High In-Reply-To: References: ,,,,,, MIME-Version: 1.0 X-OriginalArrivalTime: 25 Aug 2012 12:07:50.0650 (UTC) FILETIME=[3F9E9DA0:01CD82BA] X-Virus-Checked: Checked by ClamAV on apache.org --_8c0b658d-9f14-4644-85c6-243b67fbd5a6_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable CAn anybody enlighten me on what could be wrongg ? *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy=2C and devotion to duty is the highest form of worship = of God.=94=20 "Maybe other people will try to limit me but I don't limit myself" From: siddharth.tiwari@live.com To: user@hadoop.apache.org=3B bejoy.hadoop@gmail.com=3B bejoy_ks@yahoo.com Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Sat=2C 25 Aug 2012 05:35:48 +0000 Any help on below would be really appreciated. i am stuck with it *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy=2C and devotion to duty is the highest form of worship = of God.=94=20 "Maybe other people will try to limit me but I don't limit myself" From: siddharth.tiwari@live.com To: user@hadoop.apache.org=3B bejoy.hadoop@gmail.com=3B bejoy_ks@yahoo.com Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Fri=2C 24 Aug 2012 20:23:45 +0000 Hi =2C Can anyone please help ? Thank you in advance *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy=2C and devotion to duty is the highest form of worship = of God.=94=20 "Maybe other people will try to limit me but I don't limit myself" From: siddharth.tiwari@live.com To: user@hadoop.apache.org=3B bejoy.hadoop@gmail.com=3B bejoy_ks@yahoo.com Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Fri=2C 24 Aug 2012 16:22:57 +0000 Hi Team=2C Thanks a lot for so many good suggestions. I wrote a custom input format fo= r reading one paragraph at a time. But when I use it I get lines read. Can = you please suggest what changes I must make to read one para at a time sepe= rated by null lines ? below is the code I wrote:- import java.io.IOException=3B import java.util.ArrayList=3B import java.util.regex.Matcher=3B import java.util.regex.Pattern=3B import java.io.IOException=3B import java.util.ArrayList=3B import java.util.List=3B import org.apache.hadoop.conf.Configuration=3B import org.apache.hadoop.fs.FSDataInputStream=3B import org.apache.hadoop.fs.FileStatus=3B import org.apache.hadoop.fs.FileSystem=3B import org.apache.hadoop.fs.Path=3B import org.apache.hadoop.io.LongWritable=3B import org.apache.hadoop.io.Text=3B import org.apache.hadoop.mapred.JobConf=3B import org.apache.hadoop.mapreduce.InputSplit=3B import org.apache.hadoop.mapreduce.Job=3B import org.apache.hadoop.mapreduce.JobContext=3B import org.apache.hadoop.mapreduce.RecordReader=3B import org.apache.hadoop.mapreduce.TaskAttemptContext=3B import org.apache.hadoop.mapreduce.lib.input.FileInputFormat=3B import org.apache.hadoop.mapreduce.lib.input.FileSplit=3B import org.apache.hadoop.mapreduce.lib.input.LineRecordReader=3B import org.apache.hadoop.util.LineReader=3B /** *=20 */ /** * @author 460615 * */ //FileInputFormat is the base class for all file-based InputFormats public class ParaInputFormat extends FileInputFormat { private String nullRegex =3D "^\\s*$" =3B public String StrLine =3D null=3B /*public RecordReader getRecordReader (InputSplit gen= ericSplit=2C JobConf job=2C Reporter reporter) throws IOException { reporter.setStatus(genericSplit.toString())=3B return new ParaInputFormat(job=2C (FileSplit)genericSplit)=3B }*/ public RecordReader createRecordReader(InputSplit gen= ericSplit=2C TaskAttemptContext context)throws IOException { context.setStatus(genericSplit.toString())=3B return new LineRecordReader()=3B } public InputSplit[] getSplits(JobContext job=2C Configuration conf) throws = IOException { ArrayList splits =3D new ArrayList()=3B for (FileStatus status : listStatus(job)) { Path fileName =3D status.getPath()=3B if (status.isDir()) { throw new IOException("Not a file: " + fileName)=3B } FileSystem fs =3D fileName.getFileSystem(conf)=3B LineReader lr =3D null=3B try { FSDataInputStream in =3D fs.open(fileName)=3B lr =3D new LineReader(in=2C conf)=3B // String regexMatch =3Din.readLine()=3B Text line =3D new Text()=3B long begin =3D 0=3B long length =3D 0=3B int num =3D -1=3B String boolTest =3D null=3B boolean match =3D false=3B Pattern p =3D Pattern.compile(nullRegex)=3B // Matcher matcher =3D new p.matcher()=3B while ((boolTest =3D in.readLine()) !=3D null && (num =3D lr.readLine(line)= ) > 0 && ! ( in.readLine().isEmpty())){ // numLines++=3B length +=3D num=3B =20 =20 splits.add(new FileSplit(fileName=2C begin=2C length=2C new String[]{}))=3B= } begin=3Dlength=3B }finally { if (lr !=3D null) { lr.close()=3B } =20 =20 =20 } =20 } return splits.toArray(new FileSplit[splits.size()])=3B } =20 } *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy=2C and devotion to duty is the highest form of worship = of God.=94=20 "Maybe other people will try to limit me but I don't limit myself" > Date: Fri=2C 24 Aug 2012 09:54:10 +0200 > Subject: Re: Reading multiple lines from a microsoft doc in hadoop > From: haavard.kongsgaard@gmail.com > To: user@hadoop.apache.org >=20 > Hi=2C maybe you should check out the old nutch project > http://nutch.apache.org/ (hadoop was developed for nutch). > It's a web crawler and indexer=2C but the malinglists hold much info > doc/pdf parsing which also relates to hadoop. >=20 > Have never parsed many docx or doc files=2C but it should be > strait-forward. But generally for text analysis preprocessing is the > KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a > simple trick) >=20 >=20 > -H=E5vard >=20 > On Fri=2C Aug 24=2C 2012 at 9:30 AM=2C Siddharth Tiwari > wrote: > > Hi=2C > > Thank you for the suggestion. Actually I was using poi to extract text= =2C but > > since now I have so many documents I thought I will use hadoop direc= tly > > to parse as well. Average size of each document is around 120 kb. Also = I > > want to read multiple lines from the text until I find a blank line. I = do > > not have any idea ankit how to design custom input format and record re= ader. > > Pleaser help with some tutorial tutorial=2C code or resource around it.= I am > > struggling with the issue. I will be highly grateful. Thank you so much= once > > again > > > >> Date: Fri=2C 24 Aug 2012 08:07:39 +0200 > >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop > >> From: haavard.kongsgaard@gmail.com > >> To: user@hadoop.apache.org > > > >> > >> It's much easier if you convert the documents to text first > >> > >> use > >> http://tika.apache.org/ > >> > >> or some other doc parser > >> > >> > >> -H=E5vard > >> > >> On Fri=2C Aug 24=2C 2012 at 7:52 AM=2C Siddharth Tiwari > >> wrote: > >> > hi=2C > >> > I have doc files in msword doc and docx format. These have entries w= hich > >> > are > >> > seperated by an empty line. Is it possible for me to read > >> > these lines separated from empty lines at a time. Also which inpurfo= rmat > >> > shall I use to read doc docx. Please help > >> > > >> > *------------------------* > >> > Cheers !!! > >> > Siddharth Tiwari > >> > Have a refreshing day !!! > >> > "Every duty is holy=2C and devotion to duty is the highest form of w= orship > >> > of > >> > God.=94 > >> > "Maybe other people will try to limit me but I don't limit myself" > >> > >> > >> > >> -- > >> H=E5vard Wahl Kongsg=E5rd > >> Faculty of Medicine & > >> Department of Mathematical Sciences > >> NTNU > >> > >> http://havard.security-review.net/ >=20 >=20 >=20 > --=20 > H=E5vard Wahl Kongsg=E5rd > Faculty of Medicine & > Department of Mathematical Sciences > NTNU >=20 > http://havard.security-review.net/ = --_8c0b658d-9f14-4644-85c6-243b67fbd5a6_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable

CAn anybody enlighten me on what could be wrongg ?

*------------------------*

= Cheers !!!
= Siddharth Tiwari Have a r= efreshing day !!!"Every duty is holy=2C and devotion to duty is the highest form of wors= hip of God.=94
"Maybe other people will try to limit me but I do= n't limit myself"



From: siddharth.tiwari@live.com
To: user@hadoop.apache.org=3B bejo= y.hadoop@gmail.com=3B bejoy_ks@yahoo.com
Subject: RE: Reading multiple l= ines from a microsoft doc in hadoop
Date: Sat=2C 25 Aug 2012 05:35:48 +0= 000


=
Any help on below would be really appreciated. i am stuck with it
*------------------------*
= Cheers !!!
= Siddharth Tiwari Have a r= efreshing day !!!"Every duty is holy=2C and devotion to duty is the highest form of wors= hip of God.=94
"Maybe other people will try to limit me but I do= n't limit myself"



From: siddharth.tiwari@live.com
To: user@hadoop.apache.org= =3B bejoy.hadoop@gmail.com=3B bejoy_ks@yahoo.com
Subject: RE: Reading mu= ltiple lines from a microsoft doc in hadoop
Date: Fri=2C 24 Aug 2012 20:= 23:45 +0000

Hi =2C

Can anyone= please help ?

Thank you in advance

*------------------------*
= Cheers !!!
= Siddharth Tiwari Have a r= efreshing day !!!"Every duty is holy=2C and devotion to duty is the highest form of wors= hip of God.=94
"Maybe other people will try to limit me but I do= n't limit myself"



From: siddharth.tiwari@live.com
To: user@hadoop.apache.org= =3B bejoy.hadoop@gmail.com=3B bejoy_ks@yahoo.com
Subject: RE: Reading mu= ltiple lines from a microsoft doc in hadoop
Date: Fri=2C 24 Aug 2012 16:= 22:57 +0000

Hi Team=2C

Thanks= a lot for so many good suggestions. I wrote a custom input format for read= ing one paragraph at a time. But when I use it I get lines read. Can you pl= ease suggest what changes I must make to read one para at a time seperated = by null lines ?
below is the code I wrote:-


import java.io.IO= Exception=3B
import java.util.ArrayList=3B
import java.util.regex.Mat= cher=3B
import java.util.regex.Pattern=3B
import java.io.IOException= =3B
import java.util.ArrayList=3B
import java.util.List=3B

imp= ort org.apache.hadoop.conf.Configuration=3B
import org.apache.hadoop.fs.= FSDataInputStream=3B
import org.apache.hadoop.fs.FileStatus=3B
import= org.apache.hadoop.fs.FileSystem=3B
import org.apache.hadoop.fs.Path=3B<= br>import org.apache.hadoop.io.LongWritable=3B
import org.apache.hadoop.= io.Text=3B
import org.apache.hadoop.mapred.JobConf=3B
import org.apac= he.hadoop.mapreduce.InputSplit=3B
import org.apache.hadoop.mapreduce.Job= =3B
import org.apache.hadoop.mapreduce.JobContext=3B
import org.apach= e.hadoop.mapreduce.RecordReader=3B
import org.apache.hadoop.mapreduce.Ta= skAttemptContext=3B
import org.apache.hadoop.mapreduce.lib.input.FileInp= utFormat=3B
import org.apache.hadoop.mapreduce.lib.input.FileSplit=3Bimport org.apache.hadoop.mapreduce.lib.input.LineRecordReader=3B
import= org.apache.hadoop.util.LineReader=3B




/**
 =3B* <= br> =3B*/

/**
 =3B* @author 460615
 =3B*
 = =3B*/
//FileInputFormat is the base class for all file-based InputFormat= s
public class ParaInputFormat extends FileInputFormat<=3BLongWritable= =2CText>=3B {
private String nullRegex =3D "^\\s*$" =3B
public Stri= ng StrLine =3D null=3B
/*public RecordReader<=3BLongWritable=2C Text&g= t=3B getRecordReader (InputSplit genericSplit=2C JobConf job=2C Reporter re= porter) throws IOException {
reporter.setStatus(genericSplit.toString())= =3B
return new ParaInputFormat(job=2C (FileSplit)genericSplit)=3B
}*/=
public RecordReader<=3BLongWritable=2C Text>=3B createRecordReader(= InputSplit genericSplit=2C TaskAttemptContext context)throws IOException {<= br> =3B =3B context.setStatus(genericSplit.toString())=3B
 = =3B =3B return new LineRecordReader()=3B
 =3B}


public= InputSplit[] getSplits(JobContext job=2C Configuration conf) throws IOExce= ption {
ArrayList<=3BFileSplit>=3B splits =3D new ArrayList<=3BFil= eSplit>=3B()=3B
for (FileStatus status : listStatus(job)) {
Path fi= leName =3D status.getPath()=3B
if (status.isDir()) {
throw new IOExce= ption("Not a file: " + fileName)=3B
}
FileSystem =3B fs =3D fileN= ame.getFileSystem(conf)=3B
LineReader lr =3D null=3B
try {
FSDataI= nputStream in =3B =3D fs.open(fileName)=3B
lr =3D new LineReader(in= =2C conf)=3B
// String regexMatch =3Din.readLine()=3B
Text line =3D n= ew Text()=3B
long begin =3D 0=3B
long length =3D 0=3B
int num =3D = -1=3B
String boolTest =3D null=3B
boolean match =3D false=3B
Patte= rn p =3D Pattern.compile(nullRegex)=3B
// Matcher matcher =3D new p.matc= her()=3B
while ((boolTest =3D in.readLine()) !=3D null &=3B&=3B (n= um =3D lr.readLine(line)) >=3B 0 &=3B&=3B ! ( in.readLine().isEmpty= ())){
// numLines++=3B
length +=3D num=3B
 =3B
 =3B
= splits.add(new FileSplit(fileName=2C begin=2C length=2C new String[]{}))=3B= }
begin=3Dlength=3B
}finally {
if (lr !=3D null) {
lr.close()= =3B
}
 =3B
 =3B
 =3B
}
 =3B
}
retur= n splits.toArray(new FileSplit[splits.size()])=3B
}
 =3B

<= br>}





*------------------------*
= Cheers !!!
= Siddharth Tiwari Have a r= efreshing day !!!"Every duty is holy=2C and devotion to duty is the highest form of wors= hip of God.=94
"Maybe other people will try to limit me but I do= n't limit myself"


>=3B Date: Fr= i=2C 24 Aug 2012 09:54:10 +0200
>=3B Subject: Re: Reading multiple lin= es from a microsoft doc in hadoop
>=3B From: haavard.kongsgaard@gmail.= com
>=3B To: user@hadoop.apache.org
>=3B
>=3B Hi=2C maybe y= ou should check out the old nutch project
>=3B http://nutch.apache.org= / (hadoop was developed for nutch).
>=3B It's a web crawler and indexe= r=2C but the malinglists hold much info
>=3B doc/pdf parsing which als= o relates to hadoop.
>=3B
>=3B Have never parsed many docx or do= c files=2C but it should be
>=3B strait-forward. But generally for tex= t analysis preprocessing is the
>=3B KEY! For example replace dual lin= es \r\n\r\n or (\n\n) with #### is a
>=3B simple trick)
>=3B
= >=3B
>=3B -H=E5vard
>=3B
>=3B On Fri=2C Aug 24=2C 2012 a= t 9:30 AM=2C Siddharth Tiwari
>=3B <=3Bsiddharth.tiwari@live.com>= =3B wrote:
>=3B >=3B Hi=2C
>=3B >=3B Thank you for the sugges= tion. Actually I was using poi to extract text=2C but
>=3B >=3B sinc= e now I have so many documents I thought I will use hadoop directly
&= gt=3B >=3B to parse as well. Average size of each document is around 120 = kb. Also I
>=3B >=3B want to read multiple lines from the text until= I find a blank line. I do
>=3B >=3B not have any idea ankit how to = design custom input format and record reader.
>=3B >=3B Pleaser help= with some tutorial tutorial=2C code or resource around it. I am
>=3B = >=3B struggling with the issue. I will be highly grateful. Thank you so m= uch once
>=3B >=3B again
>=3B >=3B
>=3B >=3B>=3B Dat= e: Fri=2C 24 Aug 2012 08:07:39 +0200
>=3B >=3B>=3B Subject: Re: Re= ading multiple lines from a microsoft doc in hadoop
>=3B >=3B>=3B = From: haavard.kongsgaard@gmail.com
>=3B >=3B>=3B To: user@hadoop.a= pache.org
>=3B >=3B
>=3B >=3B>=3B
>=3B >=3B>=3B It= 's much easier if you convert the documents to text first
>=3B >=3B&= gt=3B
>=3B >=3B>=3B use
>=3B >=3B>=3B http://tika.apache.= org/
>=3B >=3B>=3B
>=3B >=3B>=3B or some other doc parser=
>=3B >=3B>=3B
>=3B >=3B>=3B
>=3B >=3B>=3B -H=E5= vard
>=3B >=3B>=3B
>=3B >=3B>=3B On Fri=2C Aug 24=2C 2012= at 7:52 AM=2C Siddharth Tiwari
>=3B >=3B>=3B <=3Bsiddharth.tiwa= ri@live.com>=3B wrote:
>=3B >=3B>=3B >=3B hi=2C
>=3B >= =3B>=3B >=3B I have doc files in msword doc and docx format. These have= entries which
>=3B >=3B>=3B >=3B are
>=3B >=3B>=3B >= =3B seperated by an empty line. Is it possible for me to read
>=3B >= =3B>=3B >=3B these lines separated from empty lines at a time. Also whi= ch inpurformat
>=3B >=3B>=3B >=3B shall I use to read doc docx. = Please help
>=3B >=3B>=3B >=3B
>=3B >=3B>=3B >=3B *--= ----------------------*
>=3B >=3B>=3B >=3B Cheers !!!
>=3B = >=3B>=3B >=3B Siddharth Tiwari
>=3B >=3B>=3B >=3B Have a r= efreshing day !!!
>=3B >=3B>=3B >=3B "Every duty is holy=2C and = devotion to duty is the highest form of worship
>=3B >=3B>=3B >= =3B of
>=3B >=3B>=3B >=3B God.=94
>=3B >=3B>=3B >=3B = "Maybe other people will try to limit me but I don't limit myself"
>= =3B >=3B>=3B
>=3B >=3B>=3B
>=3B >=3B>=3B
>=3B &g= t=3B>=3B --
>=3B >=3B>=3B H=E5vard Wahl Kongsg=E5rd
>=3B &g= t=3B>=3B Faculty of Medicine &=3B
>=3B >=3B>=3B Department of= Mathematical Sciences
>=3B >=3B>=3B NTNU
>=3B >=3B>=3B>=3B >=3B>=3B http://havard.security-review.net/
>=3B
>= =3B
>=3B
>=3B --
>=3B H=E5vard Wahl Kongsg=E5rd
>=3B= Faculty of Medicine &=3B
>=3B Department of Mathematical Sciences<= br>>=3B NTNU
>=3B
>=3B http://havard.security-review.net/
<= /div>
= --_8c0b658d-9f14-4644-85c6-243b67fbd5a6_--