lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "chris.b" <omelhornomedomu...@gmail.com>
Subject Problem indexing Word Documents
Date Mon, 26 Nov 2007 18:17:00 GMT

okay, so i'm very new to lucene, so it may be my bad, but i can get it to
index .txt files, and when trying to index word documents (using poi), the
program starts running and when it reaches a .doc file, i get the following
errors:

Exception in thread "main"
org.apache.poi.hpsf.IllegalPropertySetDataException: The property set claims
to have a size of 16 bytes. However, it exceeds 16 bytes.
	at org.apache.poi.hpsf.Section.<init>(Section.java:255)
	at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:454)
	at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:249)
	at
org.apache.poi.hpsf.PropertySetFactory.create(PropertySetFactory.java:61)
	at org.apache.poi.POIDocument.getPropertySet(POIDocument.java:92)
	at org.apache.poi.POIDocument.readProperties(POIDocument.java:69)
	at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:147)
	at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:56)
	at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:48)
	at Indexer.indexFile(Indexer.java:76)
	at Indexer.indexDirectory(Indexer.java:57)
	at Indexer.index(Indexer.java:38)
	at Indexer.main(Indexer.java:20)

and my code is as follows:

	private static void indexFile(IndexWriter writer, File f) throws
IOException {
		if (f.isHidden() || !f.exists() || !f.canRead()) {
			return;
		}

		System.out.println("A acrescentar " + f.getCanonicalPath() + " ao
indice.");

		Document doc = new Document();
		
		// For .doc files
		if (f.getName().endsWith(".doc")){
			FileInputStream docfin = new FileInputStream(f.getAbsolutePath());
			WordExtractor docextractor = new WordExtractor(docfin);
			String content = docextractor.getText();
			doc.add(new Field("contents", content, Field.Store.NO,
Field.Index.TOKENIZED));			
		} // For .txt files
		else if (f.getName().endsWith(".txt")) {
			doc.add(new Field("contents", new FileReader(f)));
		}
		
		doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES,
Field.Index.TOKENIZED));
		writer.addDocument(doc);
	}

(I think i included all that's necessary)
Thanks in advance for any help.
-- 
View this message in context: http://www.nabble.com/Problem-indexing-Word-Documents-tf4876643.html#a13954702
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message