lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Aristov <alexander.aris...@gmail.com>
Subject Re: PDF text extracted without spaces
Date Fri, 03 Dec 2010 09:09:28 GMT
anyway even if you get correct whitespaces and new lines this won't affect
indexing.

Best Regards
Alexander Aristov


On 3 December 2010 10:00, Lance Norskog <goksron@gmail.com> wrote:

> The text should come out as a stream of words with space, but without
> any of the formatting in the PDF. Extraction is only good enough to
> tell you that a word is somewhere inside a PDF file.  Can you post a
> short bit of the text that it extracted?
>
> Also, you should try this test on different PDF files that were made
> with different software.
>
> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in> wrote:
> > Hello all,
> >
> > I know, this is not the right group to ask this question, thought some of
> you guys might have experienced.
> >
> > I newbie with Tika. I am using latest version 0.8 version. I extracted
> text from PDF document but found spaces and new line missing. Indexing the
> data gives wrong result. Could any one in this group could help me? I am
> using tika directly to extract the contents, which later gets indexed.
> >
> > Regards
> > Ganesh
> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> Download Now! http://messenger.yahoo.com/download.php
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message