hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bejoy KS" <bejoy.had...@gmail.com>
Subject Re: Reading multiple lines from a microsoft doc in hadoop
Date Fri, 24 Aug 2012 06:09:11 GMT
Hi Siddharth

I believe doc and docx have custom formatting other than text. In that case you may have to
build your own input format. Also your own record reader if you want to have the record delimiter
as an empty line. 

Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Siddharth Tiwari <siddharth.tiwari@live.com>
Date: Fri, 24 Aug 2012 05:52:13 
To: USers Hadoop<user@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Reading multiple lines from a microsoft doc in hadoop

I have doc files in msword doc and docx format. These have entries which are seperated by
an empty line. Is it possible for me to read these lines separated from empty lines at a time.
Also which inpurformat shall I use to read doc docx. Please help


Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"
View raw message