hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Håvard Wahl Kongsgård <haavard.kongsga...@gmail.com>
Subject Re: Reading multiple lines from a microsoft doc in hadoop
Date Fri, 24 Aug 2012 06:07:39 GMT
It's much easier if you convert the documents to text first

use
http://tika.apache.org/

or some other doc parser


-Håvard

On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
<siddharth.tiwari@live.com> wrote:
> hi,
> I have doc files in msword doc and docx format. These have entries which are
> seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/

Mime
View raw message