hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Håvard Wahl Kongsgård <haavard.kongsga...@gmail.com>
Subject Re: Reading multiple lines from a microsoft doc in hadoop
Date Fri, 24 Aug 2012 07:54:10 GMT
Hi, maybe you should check out the old nutch project
http://nutch.apache.org/ (hadoop was developed for nutch).
It's a web crawler and indexer, but the malinglists hold much info
doc/pdf parsing which also relates to hadoop.

Have never parsed many docx or doc files, but it should be
strait-forward. But generally for text analysis preprocessing is the
KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
simple trick)


-Håvard

On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
<siddharth.tiwari@live.com> wrote:
> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text, but
> since now  I  have so many  documents I thought I will use hadoop directly
> to parse as well. Average size of each document is around 120 kb. Also I
> want to read multiple lines from the text until I find a blank line. I do
> not have any idea ankit how to design custom input format and record reader.
> Pleaser help with some tutorial tutorial, code or resource around it. I am
> struggling with the issue. I will be highly grateful. Thank you so much once
> again
>
>> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: haavard.kongsgaard@gmail.com
>> To: user@hadoop.apache.org
>
>>
>> It's much easier if you convert the documents to text first
>>
>> use
>> http://tika.apache.org/
>>
>> or some other doc parser
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> <siddharth.tiwari@live.com> wrote:
>> > hi,
>> > I have doc files in msword doc and docx format. These have entries which
>> > are
>> > seperated by an empty line. Is it possible for me to read
>> > these lines separated from empty lines at a time. Also which inpurformat
>> > shall I use to read doc docx. Please help
>> >
>> > *------------------------*
>> > Cheers !!!
>> > Siddharth Tiwari
>> > Have a refreshing day !!!
>> > "Every duty is holy, and devotion to duty is the highest form of worship
>> > of
>> > God.”
>> > "Maybe other people will try to limit me but I don't limit myself"
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/

Mime
View raw message