jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Jackrabbit Wiki] Update of "TextExtractorExamples" by Astroknight
Date Fri, 28 Dec 2007 07:52:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The following page has been changed by Astroknight:
http://wiki.apache.org/jackrabbit/TextExtractorExamples

New page:
##language:en
== Examples for writing your own TextExtractors ==

=== Add Mime Types ===
Make sure to extract from jackrabbit-jcr-server-*.jar and add "org\apache\jackrabbit\server\io\mimetypes.properties"
to your web project's classes folder, then add mime types which are defined in your text extractor
classes. 

{{{
...
mht=message/rfc822
msg=application/msoutlook
csv=text/plain
}}}

=== Obtain Mime Type  ===
To obtain mime type from a file path use {{{MimeResolver}}} when possible, you'd better maintain
one instance as it will read the mimetypes.properties file in the construtor.

{{{
public static MimeResolver mimeResolver = new MimeResolver();
...
String contentType = mimeResolver.getMimeType(filePath);
}}}


=== Ms Poperpoint ===
To well support the text extraction of ms powerpoint files, code below could help you by leveraging
Apache POI's HSLF component.

{{{
/**
 * Text extractor for Microsoft PowerPoint presentations.
 */
public class MsPowerPointTextExtractor extends AbstractTextExtractor {

    /**
     * Force loading of dependent class.
     */
    static {
        POIFSReader.class.getName();
    }

    /**
     * Creates a new <code>MsPowerPointTextExtractor</code> instance.
     */
    public MsPowerPointTextExtractor() {
        super(new String[]{"application/vnd.ms-powerpoint",
                           "application/mspowerpoint"});
    }

    //-------------------------------------------------------< TextExtractor >

    /**
     * {@inheritDoc}
     */
    public Reader extractText(InputStream stream,
                              String type,
                              String encoding) throws IOException {
        try {
        	
        	CharArrayWriter writer = new CharArrayWriter();
            SlideShow slideShow= new SlideShow(new HSLFSlideShow(stream));
            Slide[] slides = slideShow.getSlides();
            for (int i = 0; i < slides.length; i++) {
            	Slide slide = slides[i];
            	/* Optional */
            	if(StringUtils.isNotEmpty(slide.getTitle())) {
            		writer.append(slide.getTitle() + " ");
            	}
            	TextRun[] textRuns = slide.getTextRuns();
            	for (int j = 0; j < textRuns.length; j++) {
            		writer.append(textRuns[j].getText() + " ");
            	}
            }
            
            return new CharArrayReader(writer.toCharArray());
            
        } finally {
            stream.close();
        }
    }
}
}}} 

Mime
View raw message