lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Problem indexing Word Documents
Date Mon, 26 Nov 2007 18:38:16 GMT
I would ask on the POI mailing list.  This doesn't look to be a  
problem with Lucene.

-Grant

On Nov 26, 2007, at 1:17 PM, chris.b wrote:

>
> okay, so i'm very new to lucene, so it may be my bad, but i can get  
> it to
> index .txt files, and when trying to index word documents (using  
> poi), the
> program starts running and when it reaches a .doc file, i get the  
> following
> errors:
>
> Exception in thread "main"
> org.apache.poi.hpsf.IllegalPropertySetDataException: The property  
> set claims
> to have a size of 16 bytes. However, it exceeds 16 bytes.
> 	at org.apache.poi.hpsf.Section.<init>(Section.java:255)
> 	at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:454)
> 	at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:249)
> 	at
> org 
> .apache.poi.hpsf.PropertySetFactory.create(PropertySetFactory.java:61)
> 	at org.apache.poi.POIDocument.getPropertySet(POIDocument.java:92)
> 	at org.apache.poi.POIDocument.readProperties(POIDocument.java:69)
> 	at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:147)
> 	at
> org 
> .apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:56)
> 	at
> org 
> .apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:48)
> 	at Indexer.indexFile(Indexer.java:76)
> 	at Indexer.indexDirectory(Indexer.java:57)
> 	at Indexer.index(Indexer.java:38)
> 	at Indexer.main(Indexer.java:20)
>
> and my code is as follows:
>
> 	private static void indexFile(IndexWriter writer, File f) throws
> IOException {
> 		if (f.isHidden() || !f.exists() || !f.canRead()) {
> 			return;
> 		}
>
> 		System.out.println("A acrescentar " + f.getCanonicalPath() + " ao
> indice.");
>
> 		Document doc = new Document();
> 		
> 		// For .doc files
> 		if (f.getName().endsWith(".doc")){
> 			FileInputStream docfin = new FileInputStream(f.getAbsolutePath());
> 			WordExtractor docextractor = new WordExtractor(docfin);
> 			String content = docextractor.getText();
> 			doc.add(new Field("contents", content, Field.Store.NO,
> Field.Index.TOKENIZED));			
> 		} // For .txt files
> 		else if (f.getName().endsWith(".txt")) {
> 			doc.add(new Field("contents", new FileReader(f)));
> 		}
> 		
> 		doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES,
> Field.Index.TOKENIZED));
> 		writer.addDocument(doc);
> 	}
>
> (I think i included all that's necessary)
> Thanks in advance for any help.
> -- 
> View this message in context: http://www.nabble.com/Problem-indexing-Word-Documents-tf4876643.html#a13954702
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message