jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Jackrabbit Wiki] Update of "TextExtractorExamples" by Astroknight
Date Fri, 28 Dec 2007 08:27:04 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The following page has been changed by Astroknight:
http://wiki.apache.org/jackrabbit/TextExtractorExamples

------------------------------------------------------------------------------
  == Examples for writing your own TextExtractors ==
  
  === Add Mime Types ===
- Make sure to extract from jackrabbit-jcr-server-*.jar and add "org\apache\jackrabbit\server\io\mimetypes.properties"
to your web project's classes folder, then add mime types which are defined in your text extractor
classes. 
+ Make sure to extract "org\apache\jackrabbit\server\io\mimetypes.properties" from jackrabbit-jcr-server-*.jar
and add the same "org\apache\jackrabbit\server\io\mimetypes.properties" to your web project's
classes folder, then add mime types which are defined in your text extractor classes to the
file. 
  
  {{{
  ...
@@ -23, +23 @@

  }}}
  
  
- === Ms Poperpoint ===
+ === Ms Powerpoint ===
  To well support the text extraction of ms powerpoint files, code below could help you by
leveraging Apache POI's HSLF component.
  
  {{{
@@ -81, +81 @@

  }
  }}} 
  
+ === Ms Mhtml ===
+ Mht files are actually based on "message/rfc822", so we could write {{{MsMHTMLTextExtractor}}}
like this:
+ 
+ {{{
+ public class MsMHTMLTextExtractor extends AbstractTextExtractor {
+ 
+ 	/**
+ 	 * Creates a new <code>MsMHTMLTextExtractor</code> instance.
+ 	 */
+ 	public MsMHTMLTextExtractor() {
+ 		super(new String[] { "message/rfc822" });
+ 	}
+ 
+ 	// -------------------------------------------------------< TextExtractor >
+ 
+ 	/**
+ 	 * {@inheritDoc}
+ 	 */
+ 	@SuppressWarnings("unchecked")
+ 	public Reader extractText(InputStream stream, String type, String encoding)
+ 			throws IOException {
+ 		try {
+ 			MimeMessage mm = new MimeMessage(null, stream);
+ 			StringBuffer sb = new StringBuffer();
+ 
+ 			getMHTMLContent(mm, sb);
+ 
+ 			return new StringReader(sb.toString());
+ 		} catch (Exception e) {
+ 			return new StringReader("");
+ 		} finally {
+ 			stream.close();
+ 		}
+ 	}
+ 
+ 	/**
+ 	 * Parse message/rfc822 part regressively
+ 	 */
+ 	public void getMHTMLContent(Part part, StringBuffer sb) throws Exception {
+ 		
+ 		if (part.isMimeType("text/plain")) {
+ 			sb.append((String) part.getContent());
+ 		} else if (part.isMimeType("text/html")) {
+ 
+ 			TransformerFactory factory = TransformerFactory.newInstance();
+ 			Transformer transformer = factory.newTransformer();
+ 			HTMLParser parser = new HTMLParser();
+ 			SAXResult result = new SAXResult(new DefaultHandler());
+ 
+ 			SAXSource source = new SAXSource(parser, new InputSource(part
+ 					.getInputStream()));
+ 			transformer.transform(source, result);
+ 
+ 			sb.append(parser.getContents());
+ 
+ 		} else if (part.isMimeType("multipart/*")) {
+ 			Multipart multipart = (Multipart) part.getContent();
+ 			int counts = multipart.getCount();
+ 			for (int i = 0; i < counts; i++) {
+ 				getMHTMLContent(multipart.getBodyPart(i), sb);
+ 			}
+ 		} else if (part.isMimeType("message/rfc822")) {
+ 			getMHTMLContent((Part) part.getContent(), sb);
+ 		} else {
+ 			//
+ 		}
+ 	}
+ }
+ }}}
+ 
+ === Ms Outlook ===
+ Msg files are used by ms outlook to save email content which may be widely used in many
organizations. Apache POI's HSMF component aims to support reading and writing msg files,
but so far it can only read the plain text by using {{{MAPIMessage}}} utility class. If you
need to extract attachments from msg files, I guess the POI Filesystem can help. 
+ 
+ Todo: msg example here~
+ 

Mime
View raw message