hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Biju Balakrishnan <bijub...@gmail.com>
Subject Re: Reading multiple lines from a microsoft doc in hadoop
Date Fri, 24 Aug 2012 06:09:52 GMT

> I have doc files in msword doc and docx format. These have entries which
> are seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
As far as i know, none of the input format supports the doc & docx(to be
noted: as far as i know).
you might need to write a custom input format to support doc[x] files.

its better to convert to text files before processing using hadoop.


View raw message