tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Meikle (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-114) PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together
Date Sat, 06 Sep 2008 11:24:44 GMT

    [ https://issues.apache.org/jira/browse/TIKA-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628859#action_12628859

Dave Meikle commented on TIKA-114:

OK, processLineSeparator  and processLineSeparator are not available in PDFBox-0.7.3 which
is what we have as our dependency. They are however available on SVN HEAD of the PDFBox Incubator
project, so if you build and use that it works fine. I noticed a lot of people are using either
dev builds or their own compiled versions.

I see that they are looking to do a first release under the new Apache Incubator project,
but need to resolve PDFBOX-366 (https://issues.apache.org/jira/browse/PDFBOX-366). Jukka,
do you know the status of this?

If we want to move release TIKA incubating-0.2 before the first PDFBox release there is a
workaround, that I don't particularly like myself but would solve the problem when using PDFBox-0.7.3
- will attach this in a patch.

> PDFParser : Getting content of the document using "writer.ToString ()" , some words are
stuck together
> ------------------------------------------------------------------------------------------------------
>                 Key: TIKA-114
>                 URL: https://issues.apache.org/jira/browse/TIKA-114
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Rida Benjelloun
>             Fix For: 0.2-incubating
> PDFParser : Getting the content of the document using "writer.ToString ()" , some words
are stuck together
> Result of PDF extraction : 
> "Apache Tika - Apache Tikahttp://incubator.apache.org/tika/1 of 115.9.2007 11:02Tika
- Content Analysis ToolkitApache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser libraries. Apache Tika
is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the
Apache Lucene PMC. Incubation is required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision making process have stabilized
in a manner consistent with other successful ASF projects. While incubation status is not
necessarily a reflection of the completeness or stability of the code, it does indicate that
the project has yet to be fully endorsed by the ASF.See the Apache Tika Incubation Status
page for the current incubation status.Latest NewsMarch 22nd, 2007: Apache Tika project startedThe
Apache Tika project was formally started when the Tika proposal was accepted by the Apache
Incubator PMC."

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message