lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ganesh" <emailg...@yahoo.co.in>
Subject Re: PDF text extracted without spaces
Date Mon, 06 Dec 2010 09:47:01 GMT
Thanks for the link. I downloaded the tool and pdftotext extracts correctly. I also dropped
a mail to the tika user group. It is a regression in latest release https://issues.apache.org/jira/browse/TIKA-548.
Hopefully coming release will have a fix.

Regards
Ganesh

----- Original Message ----- 
From: "Ralph Seward" <rj.seward@gmail.com>
To: <java-user@lucene.apache.org>
Sent: Friday, December 03, 2010 8:21 PM
Subject: Re: PDF text extracted without spaces


pdftotext has usually worked quite well for my purposes. More info at
http://www.foolabs.com/xpdf/about.html .

"Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X
components (pdftops, pdftotext, etc.) also run on Win32 systems and should
run on pretty much any system with a decent C++ compiler."

Ralph

On Fri, Dec 3, 2010 at 9:35 AM, Hans Merkl <hmerkl@rightonpoint.us> wrote:
>
> pdftotext is much better and faster from my experience.
>
>
> On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes <fabiano@nunes.me> wrote:
>
> > Have you ever tried other extractor tool than PDFBox? I used to extract
> > contents with pdfbox: its capability of extract contents wasn't a
problem,
> > but its lack of structure information was.
> > You can try poppler-utils (pdftotext) to extract contents with
> > layout structure.
> >
> > Fabiano Nunes
> >
> >
> >
> >
> >
> > On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea <ian.lea@gmail.com> wrote:
> >
> > > Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
> > > Have you tried asking on the tika mailing list?
> > > http://tika.apache.org/mail-lists.html.
> > >
> > >
> > > --
> > > Ian.
> > >
> > >
> > > On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <emailgane@yahoo.co.in> wrote:
> > > > I first extract the contents from documents using tika and latter
index
> > > it with Lucene. The problem is the extracted text from PDF using tika
has
> > no
> > > whitespaces.
> > > >
> > > > Regards
> > > > Ganesh
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
> > > > To: <java-user@lucene.apache.org>
> > > > Sent: Friday, December 03, 2010 4:40 PM
> > > > Subject: RE: PDF text extracted without spaces
> > > >
> > > >
> > > >> Hi Ganesh
> > > >>
> > > >> I encountered this same problem last week. I was thinking if it was
> > > possible to include at minimum a WhitespaceAnalyzer somewhere within
Tika
> > > which would solve the problem. I am not sure of how this would be done
as
> > I
> > > am not familiar with Tika codebase.
> > > >>
> > > >> Unfortunately I don't think that the solution to the first part of
> > this
> > > problem lies within the java-user mailing list.
> > > >>
> > > >> When were you sending extracted contents to Lucene... at what later
> > > stage?
> > > >>
> > > >> Thank you
> > > >>
> > > >> Lewis
> > > >>
> > > >> -----Original Message-----
> > > >> From: Ganesh [mailto:emailgane@yahoo.co.in]
> > > >> Sent: 03 December 2010 10:44
> > > >> To: java-user@lucene.apache.org
> > > >> Subject: Re: PDF text extracted without spaces
> > > >>
> > > >> The main problem is i am not getting whitespace and newline char.
This
> > > is happening only for PDF documents.
> > > >>
> > > >> Sample outoput: Someofthedifferencesare but it should be Some of
the
> > > differences are
> > > >>
> > > >> Regards
> > > >> Ganesh
> > > >>
> > > >> ----- Original Message -----
> > > >> From: "Alexander Aristov" <alexander.aristov@gmail.com>
> > > >> To: <java-user@lucene.apache.org>
> > > >> Sent: Friday, December 03, 2010 2:39 PM
> > > >> Subject: Re: PDF text extracted without spaces
> > > >>
> > > >>
> > > >>> anyway even if you get correct whitespaces and new lines this
won't
> > > affect
> > > >>> indexing.
> > > >>>
> > > >>> Best Regards
> > > >>> Alexander Aristov
> > > >>>
> > > >>>
> > > >>> On 3 December 2010 10:00, Lance Norskog <goksron@gmail.com>
wrote:
> > > >>>
> > > >>>> The text should come out as a stream of words with space,
but
> > without
> > > >>>> any of the formatting in the PDF. Extraction is only good
enough
to
> > > >>>> tell you that a word is somewhere inside a PDF file.  Can
you
post a
> > > >>>> short bit of the text that it extracted?
> > > >>>>
> > > >>>> Also, you should try this test on different PDF files that
were
made
> > > >>>> with different software.
> > > >>>>
> > > >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in>
> > wrote:
> > > >>>> > Hello all,
> > > >>>> >
> > > >>>> > I know, this is not the right group to ask this question,
thought
> > > some of
> > > >>>> you guys might have experienced.
> > > >>>> >
> > > >>>> > I newbie with Tika. I am using latest version 0.8 version.
I
> > > extracted
> > > >>>> text from PDF document but found spaces and new line missing.
> > Indexing
> > > the
> > > >>>> data gives wrong result. Could any one in this group could
help
me?
> > I
> > > am
> > > >>>> using tika directly to extract the contents, which later gets
> > indexed.
> > > >>>> >
> > > >>>> > Regards
> > > >>>> > Ganesh
> > > >>>> > Send free SMS to your Friends on Mobile from your Yahoo!
> > Messenger.
> > > >>>> Download Now! http://messenger.yahoo.com/download.php
> > > >>>> >
> > > >>>> >
> > > ---------------------------------------------------------------------
> > > >>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>> > For additional commands, e-mail:
java-user-help@lucene.apache.org
> > > >>>> >
> > > >>>> >
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Lance Norskog
> > > >>>> goksron@gmail.com
> > > >>>>
> > > >>>>
> > ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>>>
> > > >>>>
> > > >>>
> > > >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > > Download Now! http://messenger.yahoo.com/download.php
> > > >>
> > > >>
---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >> Email has been scanned for viruses by Altman Technologies' email
> > > management service - www.altman.co.uk/emailsystems
> > > >>
> > > >> Glasgow Caledonian University is a registered Scottish charity,
number
> > > SC021474
> > > >>
> > > >> Winner: Times Higher Education’s Widening Participation Initiative
of
> > > the Year 2009 and Herald Society’s Education Initiative of the Year
2009
> > > >>
> > >
> >
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> > > >>
> > > > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > > Download Now! http://messenger.yahoo.com/download.php
> > > >
> > > >
---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
>
> --
>
> Hans Merkl
> Right On Point, LLC
> 215 Victor Parkway, Suite E
> Annapolis, MD 21403
>
> Phone: (443) 951-4324
> E-mail: hmerkl@rightonpoint.us

Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message