poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suba Suresh <su...@wolfram.com>
Subject Re: PowerPoint extractor
Date Thu, 06 Jul 2006 16:18:10 GMT
I tried the July4th build. The warnings are gone. Thank You.

I used the following code for a couple of small excel files to index 
with lucene. I don't know how effective the search is going to be since 
it is still in the implementation stage.If there are any errors please 
let me know.

public class ExcelHandler implements DocumentHandler {
	
	String fileName;
	public ExcelHandler(String name) {
		super();
		fileName = new String(name);
		
	}

public Document getDocument(InputStream is) throws 
DocumentHandlerException {

Document doc = new Document();
POIFSDocument pdoc = new POIFSDocument(fileName,is);
DocumentInputStream docis = new DocumentInputStream(pdoc);
byte[] content = new byte[docis.available()];
         docis.read(content);
         docis.close();
         StringBuffer textBuf = new StringBuffer();
	for(int i =0; i<content.length; i++){
         	String byteString = new Byte(content[i]).toString();
         	 textBuf.append(byteString);
         }
         String text = textBuf.toString();
	if((text!=null) && (!text.equals(""))){
			
		doc.add(new Field("body", text, Field.Store.YES, Field.Index.NO));
		}
	}

	catch(IOException io){
		throw new DocumentHandlerException("Cannot parse Excel Document", io);
	}
		return doc;
	}
}

Separately in another file I am indexing the filename, filepath, date as 
keywords. Hope it helps.

thanks,
suba suresh.



Nick Burch wrote:
> On Tue, 27 Jun 2006, Suba Suresh wrote:
> 
>>Thank you for all the pointers.  It is a great help. I used today's
>>build. It worked fine for WordDocument. I did not try the meta data yet.
>>For PowerPoint I am getting the following for powerpoint extractor just
>>for one file. Am I doing anything wrong? I did'nt change my code.
> 
> 
> These errors should now have gone. Can you try a new svn checkout /
> tomorrow's SVN build?
> 
> 
> 
>>Also since some the excel files were not 97-2002 format I used the
>>POIFSFilesystem and read it as a bytestream and stored as text string. I
>>hope that is fine.
> 
> 
> If you have some code for getting some basic text out of Excel 95 files,
> we'd be interested in hosting it. I'm sure that something that outputs
> text that can be fed to lucene would be useful for a lot of people, even
> if that's all the excel 95 support we have.
> 
> Nick
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Mime
View raw message