hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siddharth Tiwari <siddharth.tiw...@live.com>
Subject RE: Reading multiple lines from a microsoft doc in hadoop
Date Fri, 24 Aug 2012 07:30:22 GMT
Hi,
Thank you for the suggestion. Actually I was using poi to extract text, but since now  I 
have so many  documents I thought I will use hadoop directly to parse as well. Average size
of each document is around 120 kb. Also I want to read multiple lines from the text until
I find a blank line. I do not have any idea ankit how to design custom input format and record
reader. Pleaser help with some tutorial tutorial, code or resource around it. I am struggling
with the issue. I will be highly grateful. Thank you so much once again

> Date: Fri, 24 Aug 2012 08:07:39 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> It's much easier if you convert the documents to text first
> 
> use
> http://tika.apache.org/
> 
> or some other doc parser
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> <siddharth.tiwari@live.com> wrote:
> > hi,
> > I have doc files in msword doc and docx format. These have entries which are
> > seperated by an empty line. Is it possible for me to read
> > these lines separated from empty lines at a time. Also which inpurformat
> > shall I use to read doc docx. Please help
> >
> > *------------------------*
> > Cheers !!!
> > Siddharth Tiwari
> > Have a refreshing day !!!
> > "Every duty is holy, and devotion to duty is the highest form of worship of
> > God.”
> > "Maybe other people will try to limit me but I don't limit myself"
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		  
Mime
View raw message