jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Reutegger (JIRA)" <j...@apache.org>
Subject [jira] Commented: (JCR-281) textfilters module patch: Support for text extraction for HTML,XML and RTF files
Date Wed, 30 Nov 2005 08:43:31 GMT
    [ http://issues.apache.org/jira/browse/JCR-281?page=comments#action_12358893 ] 

Marcel Reutegger commented on JCR-281:
--------------------------------------

Martin, I quickly checked the web and there are some alternatives that you might want to consider
for parsing html:

- javax.swing.text.html.parser.Parser (part of the 1.4 JDK)
- http://www.apache.org/~andyc/neko/doc/html/ (apache license)


> textfilters module patch: Support for text extraction for HTML,XML and RTF files
> --------------------------------------------------------------------------------
>
>          Key: JCR-281
>          URL: http://issues.apache.org/jira/browse/JCR-281
>      Project: Jackrabbit
>         Type: Improvement
>   Components: query
>     Reporter: Martin Perez
>  Attachments: patch.diff
>
> This patch adds text extraction support form XML, RTF and HTML files.
> The unique dependency is htmlparser library for handling HTML text extraction.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message