lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabiano Nunes <fabi...@nunes.me>
Subject Re: PDF text extracted without spaces
Date Fri, 03 Dec 2010 13:52:07 GMT
Have you ever tried other extractor tool than PDFBox? I used to extract
contents with pdfbox: its capability of extract contents wasn't a problem,
but its lack of structure information was.
You can try poppler-utils (pdftotext) to extract contents with
layout structure.

Fabiano Nunes





On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea <ian.lea@gmail.com> wrote:

> Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
> Have you tried asking on the tika mailing list?
> http://tika.apache.org/mail-lists.html.
>
>
> --
> Ian.
>
>
> On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <emailgane@yahoo.co.in> wrote:
> > I first extract the contents from documents using tika and latter index
> it with Lucene. The problem is the extracted text from PDF using tika has no
> whitespaces.
> >
> > Regards
> > Ganesh
> >
> >
> > ----- Original Message -----
> > From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
> > To: <java-user@lucene.apache.org>
> > Sent: Friday, December 03, 2010 4:40 PM
> > Subject: RE: PDF text extracted without spaces
> >
> >
> >> Hi Ganesh
> >>
> >> I encountered this same problem last week. I was thinking if it was
> possible to include at minimum a WhitespaceAnalyzer somewhere within Tika
> which would solve the problem. I am not sure of how this would be done as I
> am not familiar with Tika codebase.
> >>
> >> Unfortunately I don't think that the solution to the first part of this
> problem lies within the java-user mailing list.
> >>
> >> When were you sending extracted contents to Lucene... at what later
> stage?
> >>
> >> Thank you
> >>
> >> Lewis
> >>
> >> -----Original Message-----
> >> From: Ganesh [mailto:emailgane@yahoo.co.in]
> >> Sent: 03 December 2010 10:44
> >> To: java-user@lucene.apache.org
> >> Subject: Re: PDF text extracted without spaces
> >>
> >> The main problem is i am not getting whitespace and newline char. This
> is happening only for PDF documents.
> >>
> >> Sample outoput: Someofthedifferencesare but it should be Some of the
> differences are
> >>
> >> Regards
> >> Ganesh
> >>
> >> ----- Original Message -----
> >> From: "Alexander Aristov" <alexander.aristov@gmail.com>
> >> To: <java-user@lucene.apache.org>
> >> Sent: Friday, December 03, 2010 2:39 PM
> >> Subject: Re: PDF text extracted without spaces
> >>
> >>
> >>> anyway even if you get correct whitespaces and new lines this won't
> affect
> >>> indexing.
> >>>
> >>> Best Regards
> >>> Alexander Aristov
> >>>
> >>>
> >>> On 3 December 2010 10:00, Lance Norskog <goksron@gmail.com> wrote:
> >>>
> >>>> The text should come out as a stream of words with space, but without
> >>>> any of the formatting in the PDF. Extraction is only good enough to
> >>>> tell you that a word is somewhere inside a PDF file.  Can you post a
> >>>> short bit of the text that it extracted?
> >>>>
> >>>> Also, you should try this test on different PDF files that were made
> >>>> with different software.
> >>>>
> >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in>
wrote:
> >>>> > Hello all,
> >>>> >
> >>>> > I know, this is not the right group to ask this question, thought
> some of
> >>>> you guys might have experienced.
> >>>> >
> >>>> > I newbie with Tika. I am using latest version 0.8 version. I
> extracted
> >>>> text from PDF document but found spaces and new line missing. Indexing
> the
> >>>> data gives wrong result. Could any one in this group could help me?
I
> am
> >>>> using tika directly to extract the contents, which later gets indexed.
> >>>> >
> >>>> > Regards
> >>>> > Ganesh
> >>>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> >>>> Download Now! http://messenger.yahoo.com/download.php
> >>>> >
> >>>> >
> ---------------------------------------------------------------------
> >>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>> >
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Lance Norskog
> >>>> goksron@gmail.com
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> Download Now! http://messenger.yahoo.com/download.php
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >> Email has been scanned for viruses by Altman Technologies' email
> management service - www.altman.co.uk/emailsystems
> >>
> >> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
> >>
> >> Winner: Times Higher Education’s Widening Participation Initiative of
> the Year 2009 and Herald Society’s Education Initiative of the Year 2009
> >>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> >>
> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> Download Now! http://messenger.yahoo.com/download.php
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message