hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregory Lawrence <gr...@yahoo-inc.com>
Subject Re: How read index and data file?
Date Thu, 14 Oct 2010 19:19:03 GMT

Could you explain what you mean by index file? Generally speaking, mapper output files are
written as text files, sequence files, or some other format. What format uses an additional
index file? In my experience, examining the contents of a text or sequence file can be accomplished
by typing:

hadoop fs -text filename.txt

This should print out the contents in a human-readable format.

Greg Lawrence

On 10/14/10 11:35 AM, "Pedro Costa" <psdc1978@gmail.com> wrote:

- My question is because I would like to read the map output data file
and I don't know why.
When I mean I don't know why, it's because I know that the Index file
contains the information about the start offset, the raw length, and
the compression length of the data file, and if I want to read the
data file I also have to pay attention to the type of key and value
that fills the file. I just would like to build an example to read the
data file with the help of the index file, and I don't know how to do

- What the difference between the
org.apache.hadoop.mapred.IFile.Reader and the


On Thu, Oct 14, 2010 at 6:21 PM, Gregory Lawrence <gregl@yahoo-inc.com> wrote:
> Pedro,
> I'm not sure I fully understand your question but if you are asking how to
> read in an index file in addition to the standard job input, you should look
> into writing your own setup function. It may look something like the
> following:
> public void setup(Context context) throws IOException, InterruptedException
> {
>      Configuration conf = context.getConfiguration();
>      initialize(conf);
>      Path path = new Path(fileName);
>      FileSystem fs = path.getFileSystem(conf);
>      BufferedReader reader = new BufferedReader(new
> InputStreamReader(fs.open(path)));
>      ...
> The setup function should also initialize any necessary data structures
> (e.g., hash tables). This, of course, assumes that your index file is small
> enough to fit in memory. You should also look into using the distributed
> cache option, as it should speed things up, especially when multiple
> Mapper/Reducer tasks run in sequence on the same machine.
> Regards,
> Greg Lawrence
> On 10/13/10 12:00 PM, "Pedro Costa" <psdc1978@gmail.com> wrote:
> Hi,
> I would like to create an example to read an index file and the data
> file that is produced as output in the map function. Can anyone give
> me an example, please?
> Thanks,
> --
> Pedro


View raw message