lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hans Merkl <hme...@rightonpoint.us>
Subject Re: PDF text extracted without spaces
Date Fri, 03 Dec 2010 14:35:20 GMT
pdftotext is much better and faster from my experience.


On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes <fabiano@nunes.me> wrote:

> Have you ever tried other extractor tool than PDFBox? I used to extract
> contents with pdfbox: its capability of extract contents wasn't a problem,
> but its lack of structure information was.
> You can try poppler-utils (pdftotext) to extract contents with
> layout structure.
>
> Fabiano Nunes
>
>
>
>
>
> On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea <ian.lea@gmail.com> wrote:
>
> > Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
> > Have you tried asking on the tika mailing list?
> > http://tika.apache.org/mail-lists.html.
> >
> >
> > --
> > Ian.
> >
> >
> > On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <emailgane@yahoo.co.in> wrote:
> > > I first extract the contents from documents using tika and latter index
> > it with Lucene. The problem is the extracted text from PDF using tika has
> no
> > whitespaces.
> > >
> > > Regards
> > > Ganesh
> > >
> > >
> > > ----- Original Message -----
> > > From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
> > > To: <java-user@lucene.apache.org>
> > > Sent: Friday, December 03, 2010 4:40 PM
> > > Subject: RE: PDF text extracted without spaces
> > >
> > >
> > >> Hi Ganesh
> > >>
> > >> I encountered this same problem last week. I was thinking if it was
> > possible to include at minimum a WhitespaceAnalyzer somewhere within Tika
> > which would solve the problem. I am not sure of how this would be done as
> I
> > am not familiar with Tika codebase.
> > >>
> > >> Unfortunately I don't think that the solution to the first part of
> this
> > problem lies within the java-user mailing list.
> > >>
> > >> When were you sending extracted contents to Lucene... at what later
> > stage?
> > >>
> > >> Thank you
> > >>
> > >> Lewis
> > >>
> > >> -----Original Message-----
> > >> From: Ganesh [mailto:emailgane@yahoo.co.in]
> > >> Sent: 03 December 2010 10:44
> > >> To: java-user@lucene.apache.org
> > >> Subject: Re: PDF text extracted without spaces
> > >>
> > >> The main problem is i am not getting whitespace and newline char. This
> > is happening only for PDF documents.
> > >>
> > >> Sample outoput: Someofthedifferencesare but it should be Some of the
> > differences are
> > >>
> > >> Regards
> > >> Ganesh
> > >>
> > >> ----- Original Message -----
> > >> From: "Alexander Aristov" <alexander.aristov@gmail.com>
> > >> To: <java-user@lucene.apache.org>
> > >> Sent: Friday, December 03, 2010 2:39 PM
> > >> Subject: Re: PDF text extracted without spaces
> > >>
> > >>
> > >>> anyway even if you get correct whitespaces and new lines this won't
> > affect
> > >>> indexing.
> > >>>
> > >>> Best Regards
> > >>> Alexander Aristov
> > >>>
> > >>>
> > >>> On 3 December 2010 10:00, Lance Norskog <goksron@gmail.com> wrote:
> > >>>
> > >>>> The text should come out as a stream of words with space, but
> without
> > >>>> any of the formatting in the PDF. Extraction is only good enough
to
> > >>>> tell you that a word is somewhere inside a PDF file.  Can you post
a
> > >>>> short bit of the text that it extracted?
> > >>>>
> > >>>> Also, you should try this test on different PDF files that were
made
> > >>>> with different software.
> > >>>>
> > >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in>
> wrote:
> > >>>> > Hello all,
> > >>>> >
> > >>>> > I know, this is not the right group to ask this question,
thought
> > some of
> > >>>> you guys might have experienced.
> > >>>> >
> > >>>> > I newbie with Tika. I am using latest version 0.8 version.
I
> > extracted
> > >>>> text from PDF document but found spaces and new line missing.
> Indexing
> > the
> > >>>> data gives wrong result. Could any one in this group could help
me?
> I
> > am
> > >>>> using tika directly to extract the contents, which later gets
> indexed.
> > >>>> >
> > >>>> > Regards
> > >>>> > Ganesh
> > >>>> > Send free SMS to your Friends on Mobile from your Yahoo!
> Messenger.
> > >>>> Download Now! http://messenger.yahoo.com/download.php
> > >>>> >
> > >>>> >
> > ---------------------------------------------------------------------
> > >>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>> >
> > >>>> >
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Lance Norskog
> > >>>> goksron@gmail.com
> > >>>>
> > >>>>
> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>>
> > >>>>
> > >>>
> > >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > Download Now! http://messenger.yahoo.com/download.php
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >> Email has been scanned for viruses by Altman Technologies' email
> > management service - www.altman.co.uk/emailsystems
> > >>
> > >> Glasgow Caledonian University is a registered Scottish charity, number
> > SC021474
> > >>
> > >> Winner: Times Higher Education’s Widening Participation Initiative of
> > the Year 2009 and Herald Society’s Education Initiative of the Year 2009
> > >>
> >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> > >>
> > > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > Download Now! http://messenger.yahoo.com/download.php
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>



-- 

Hans Merkl
Right On Point, LLC
215 Victor Parkway, Suite E
Annapolis, MD 21403

Phone: (443) 951-4324
E-mail: hmerkl@rightonpoint.us

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message