hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <ku...@apache.org>
Subject Re: Hadoop custom readers and writers
Date Sun, 07 Sep 2008 02:29:38 GMT
We did something similar with the ARC format where is record (webpage) 
is gzipped and then appended.  It is not exactly the same but it may 
help.  Take a look at the following classes, they are in the Nutch trunk:

org.apache.nutch.tools.arc.ArcInputFormat
org.apache.nutch.tools.arc.ArcRecordReader

The way we did it though was to create an InputFormat and RecordReader 
that extended FileInputFormat and would read and uncompress the records 
on the fly. Unless your files are small I would recommend going that route.

Dennis

Amit Simgh wrote:
> Hi,
> 
> I have thousands of  webpages each represented as serialized tree object 
> compressed (ZLIB)  together (file size varying from 2.5 GB to 4.5GB).
> I have to do some heavy text processing on these pages.
> 
> What the the best way to read /access these pages.
> 
> Method1
> ***************
> 1) Write Custom Splitter that
>   1. uncompresses the file(2.5GB to 4GB) and then parses it(time : 
> around 10 minutes )
>   2. Splits the binary data in to parts 10-20
> 2) Implement specific readers to read a page and present it to mapper
> 
> OR.
> 
> Method -2
> ***************
> Read the entire file w/o splitting : one one Map task per file.
> Implement specific readers to read a page and present it to mapper
> 
> Slight detour:
> I was browing thru code in FileInputFormat and TextInputFormat. In 
> getSplit method the file is broken at arbitary byte boundaries.
> So in case of TextInputFormat what if last line of mapper is truncated 
> (incomplete byte sequence). what happens. Is truncated data lost or 
> recovered
> Can someone explain and give pointers in code where and how this 
> recovery  happens?
> 
> I also saw classes like Records . What are these used for?
> 
> 
> Regds
> Amit S

Mime
View raw message