hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carsten Schnober <schno...@ids-mannheim.de>
Subject How to use HarFileSystem?
Date Thu, 16 Aug 2012 09:34:42 GMT
Dear list,
I'm rather new to HDFS and I am trying to figure out how to use the
HarFileSystem class. I have created a little sample Harchive for testing
purpose that looks like this:

==============================================================
$ bin/hadoop fs -ls har:///WPD.har/00001
Found 8 items
-rw-r--r--   1 schnober supergroup       6516 2012-08-15 17:53
/WPD.har/00001/text.xml
-rw-r--r--   1 schnober supergroup        471 2012-08-15 17:53
/WPD.har/00001/metadata.xml
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/xip
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/connexor
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/base
-rw-r--r--   1 schnober supergroup       3728 2012-08-15 17:53
/WPD.har/00001/header.xml
-rw-r--r--   1 schnober supergroup       6209 2012-08-15 17:53
/WPD.har/00001/text.txt
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/tree_tagger
==============================================================

Now, I am trying to read the files contained in that archive
programmatically with the following Java code:

==============================================================
FileSystem hdfs;
HarFileSystem harfs;
Path dir = new Path("har:///WPD.har/00001");
Configuration conf = new Configuration();
conf.addResource(new
Path("/home/schnober/hadoop-1.0.3/conf/core-site.xml"));
System.out.println(conf.get("fs.default.name"));
FileStatus[] files;
FSDataInputStream in;
		
try {
  hdfs = FileSystem.get(conf);
  harfs = new HarFileSystem(hdfs);
  files = harfs.listStatus(dir);
  System.err.println("Reading "+files.length+" files in "+dir);
			
  for (FileStatus file : files) {
    if (file.isDir())
      continue;
    byte[] buffer = new byte[(int) file.getLen()];
    in = harfs.open(file.getPath());
    in.read(buffer);
    System.out.println(new String(buffer));
    in.close();
  }
} catch (IOException e) {
  e.printStackTrace();
}
==============================================================

However, a NullPointerException is thrown when harfs.listStatus(dir) is
executed. I suppose this means that 'dir' allegedly does not exist as
stated in the Javadoc for HarFilesSystem.listStatus(): "returns null, if
Path f does not exist in the FileSystem."

I've tried numerous variations like omitting the path within the HAR
archive, but apparently, the HAR archive still cannot be read. I am able
to read the HDFS filesystem though with the same configuration using the
FileSystem class.

I assume that I'm just not aware of how to use the HarFileSystem class
correctly, but I haven't been able to find more detailed explanations or
examples; maybe a pointer to some sample code would already help me.
Thank you very much!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

Mime
View raw message