jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (JCR-1894) Word doc extraction problem
Date Mon, 29 Nov 2010 14:13:40 GMT

     [ https://issues.apache.org/jira/browse/JCR-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting resolved JCR-1894.

    Resolution: Incomplete

Without an example document there's little we can do about this. See the Tika project (http://tika.apache.org/)
for the text extraction functionality Jackrabbit nowadays uses, and file an issue at https://issues.apache.org/jira/browse/TIKA
if the problem still occurs with Tika.

> Word doc extraction problem
> ---------------------------
>                 Key: JCR-1894
>                 URL: https://issues.apache.org/jira/browse/JCR-1894
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-text-extractors
>    Affects Versions: core 1.4.3
>         Environment: OS: Windows 2003 sp2 My-eclipse6.0 / tomcat 5.5 and Athelon500+
>            Reporter: Rajesh Upadhyay
> Hi,
> I have a .doc file which contains data inside a table. Now i want to parse the table
to get the table values. Normal Parsing is not working for table( I mean using String tokenizer)
because it is giving some unwanted special characters while parsing the table. So I just want
to convert that .doc to .txt file, then only it is easy to split the values. But i can't make
it! Can any one please tell me how to parse a MS WORD TABLE Values?
> We need to know the process by which we can index a doc file excluding special characters,
> When we will show the excerpt then these special characters make it unreadable.
> Thanks in advance.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message