jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Reutegger (JIRA)" <j...@apache.org>
Subject [jira] Commented: (JCR-2365) HTML Text Extractor does not extract or index numerics
Date Thu, 29 Oct 2009 09:01:00 GMT

    [ https://issues.apache.org/jira/browse/JCR-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771340#action_12771340

Marcel Reutegger commented on JCR-2365:

Answering some follow up questions that I got from Jeremy by email:

> Is my understanding correct in that once upgrading to 1.6.1, the current Text-extractors
module will become obsolete?

no, 1.6.1 will be just a bug fix release without changes in module dependencies. 1.6.1 will
contain a fix to the HTML text extractor.

> If so will any changes be required to the workspace.xml for the textFilterClasses parameter
to enable the use of the Apache Tika
> extractors?

The Apache Tika based text extractor is only available in the upcoming 2.0 release, but not
in 1.6.x.

> Is it possible to enable this for JCR 1.6.0 so that HTML files have their numerics extracted
and indexed?

It's probably easier to patch the 1.6.0 release, build the jackrabbit-text-extractors on 1.6
branch or wait for the 1.6.1 release.

> HTML Text Extractor does not extract or index numerics
> ------------------------------------------------------
>                 Key: JCR-2365
>                 URL: https://issues.apache.org/jira/browse/JCR-2365
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: indexing, jackrabbit-text-extractors
>    Affects Versions: 1.6.0
>         Environment: Win XP-Pro; Win 2003 Enterprise 32bit
>            Reporter: Jeremy Anderson
>             Fix For: 1.6.1, 2.0.0
> Numerics such as addresses/dates/financial figures are not extracted or indexed by the
current HTML Extractor.  These values are handled properly and searchable when done via the

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message