lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: PDF text extracted without spaces
Date Fri, 03 Dec 2010 07:00:19 GMT
The text should come out as a stream of words with space, but without
any of the formatting in the PDF. Extraction is only good enough to
tell you that a word is somewhere inside a PDF file.  Can you post a
short bit of the text that it extracted?

Also, you should try this test on different PDF files that were made
with different software.

On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in> wrote:
> Hello all,
>
> I know, this is not the right group to ask this question, thought some of you guys might
have experienced.
>
> I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF
document but found spaces and new line missing. Indexing the data gives wrong result. Could
any one in this group could help me? I am using tika directly to extract the contents, which
later gets indexed.
>
> Regards
> Ganesh
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message