Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of rj.seward@gmail.com
 designates 209.85.213.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=OuhHl/ZVEW+aI6gXNrOEQIpwH0NFv2lQVpHZNomvfsj/RAFe5GZPpsWXzG67/vvs6N
         sWhb74kectMDaCgTl8a/IC2sTbVqKlBWqqCIxRtPU+H4FD+ycOPwsvJDutbmPyV1KuQr
         5QFWVzWnyHHuZgmdqkiiQN+2WAIsvjzcEpNpI=
MIME-Version: 1.0
In-Reply-To: <AANLkTimJpLaBweFqkuXP=qRPXks6rtubRE5e8g-TfbHA@mail.gmail.com>
References: <52AECCE6F5764C2D903CD6D9956B7831@sv.us.sonicwall.com>
	<AANLkTikCZsgvg4Ty86P0aAQvOarJhxfcy000iEEqwTrY@mail.gmail.com>
	<AANLkTinPoiUTnJ_0=s--oejjfbBLe4P97M8bWYrfZiuB@mail.gmail.com>
	<996005B2AD9E4482809CEF8E765197B7@sv.us.sonicwall.com>
	<90362ECB07906C40A0297F9D5211A24323C3D81F16@ITSEMBXCLUS.enterprise.gcal.ac.uk>
	<BA003BD932EF44819FC1BC036586FBB7@sv.us.sonicwall.com>
	<AANLkTi=xqjCuc-YWp=qyq98+iivoRnME1+kNgqK5Y=q6@mail.gmail.com>
	<AANLkTi=80xxOPLAGGZfC0voJQjDG1a1YKPHf+ztZTiYr@mail.gmail.com>
	<AANLkTimJpLaBweFqkuXP=qRPXks6rtubRE5e8g-TfbHA@mail.gmail.com>
Date: Fri, 3 Dec 2010 09:51:39 -0500
Message-ID: <AANLkTimt=xDC9v=ss-4pYXALzV=S-ai5J=ZfWJbzg3FS@mail.gmail.com>
Subject: Re: PDF text extracted without spaces
From: Ralph Seward <rj.seward@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=20cf3054ace509c1e8049682ad87

--20cf3054ace509c1e8049682ad87
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

pdftotext has usually worked quite well for my purposes. More info at
http://www.foolabs.com/xpdf/about.html .

"Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X
components (pdftops, pdftotext, etc.) also run on Win32 systems and should
run on pretty much any system with a decent C++ compiler."

Ralph

On Fri, Dec 3, 2010 at 9:35 AM, Hans Merkl <hmerkl@rightonpoint.us> wrote:
>
> pdftotext is much better and faster from my experience.
>
>
> On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes <fabiano@nunes.me> wrote:
>
> > Have you ever tried other extractor tool than PDFBox? I used to extract
> > contents with pdfbox: its capability of extract contents wasn't a
problem,
> > but its lack of structure information was.
> > You can try poppler-utils (pdftotext) to extract contents with
> > layout structure.
> >
> > Fabiano Nunes
> >
> >
> >
> >
> >
> > On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea <ian.lea@gmail.com> wrote:
> >
> > > Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
> > > Have you tried asking on the tika mailing list?
> > > http://tika.apache.org/mail-lists.html.
> > >
> > >
> > > --
> > > Ian.
> > >
> > >
> > > On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <emailgane@yahoo.co.in> wrote=
:
> > > > I first extract the contents from documents using tika and latter
index
> > > it with Lucene. The problem is the extracted text from PDF using tika
has
> > no
> > > whitespaces.
> > > >
> > > > Regards
> > > > Ganesh
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
> > > > To: <java-user@lucene.apache.org>
> > > > Sent: Friday, December 03, 2010 4:40 PM
> > > > Subject: RE: PDF text extracted without spaces
> > > >
> > > >
> > > >> Hi Ganesh
> > > >>
> > > >> I encountered this same problem last week. I was thinking if it wa=
s
> > > possible to include at minimum a WhitespaceAnalyzer somewhere within
Tika
> > > which would solve the problem. I am not sure of how this would be don=
e
as
> > I
> > > am not familiar with Tika codebase.
> > > >>
> > > >> Unfortunately I don't think that the solution to the first part of
> > this
> > > problem lies within the java-user mailing list.
> > > >>
> > > >> When were you sending extracted contents to Lucene... at what late=
r
> > > stage?
> > > >>
> > > >> Thank you
> > > >>
> > > >> Lewis
> > > >>
> > > >> -----Original Message-----
> > > >> From: Ganesh [mailto:emailgane@yahoo.co.in]
> > > >> Sent: 03 December 2010 10:44
> > > >> To: java-user@lucene.apache.org
> > > >> Subject: Re: PDF text extracted without spaces
> > > >>
> > > >> The main problem is i am not getting whitespace and newline char.
This
> > > is happening only for PDF documents.
> > > >>
> > > >> Sample outoput: Someofthedifferencesare but it should be Some of
the
> > > differences are
> > > >>
> > > >> Regards
> > > >> Ganesh
> > > >>
> > > >> ----- Original Message -----
> > > >> From: "Alexander Aristov" <alexander.aristov@gmail.com>
> > > >> To: <java-user@lucene.apache.org>
> > > >> Sent: Friday, December 03, 2010 2:39 PM
> > > >> Subject: Re: PDF text extracted without spaces
> > > >>
> > > >>
> > > >>> anyway even if you get correct whitespaces and new lines this
won't
> > > affect
> > > >>> indexing.
> > > >>>
> > > >>> Best Regards
> > > >>> Alexander Aristov
> > > >>>
> > > >>>
> > > >>> On 3 December 2010 10:00, Lance Norskog <goksron@gmail.com> wrote=
:
> > > >>>
> > > >>>> The text should come out as a stream of words with space, but
> > without
> > > >>>> any of the formatting in the PDF. Extraction is only good enough
to
> > > >>>> tell you that a word is somewhere inside a PDF file.  Can you
post a
> > > >>>> short bit of the text that it extracted?
> > > >>>>
> > > >>>> Also, you should try this test on different PDF files that were
made
> > > >>>> with different software.
> > > >>>>
> > > >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in>
> > wrote:
> > > >>>> > Hello all,
> > > >>>> >
> > > >>>> > I know, this is not the right group to ask this question,
thought
> > > some of
> > > >>>> you guys might have experienced.
> > > >>>> >
> > > >>>> > I newbie with Tika. I am using latest version 0.8 version. I
> > > extracted
> > > >>>> text from PDF document but found spaces and new line missing.
> > Indexing
> > > the
> > > >>>> data gives wrong result. Could any one in this group could help
me?
> > I
> > > am
> > > >>>> using tika directly to extract the contents, which later gets
> > indexed.
> > > >>>> >
> > > >>>> > Regards
> > > >>>> > Ganesh
> > > >>>> > Send free SMS to your Friends on Mobile from your Yahoo!
> > Messenger.
> > > >>>> Download Now! http://messenger.yahoo.com/download.php
> > > >>>> >
> > > >>>> >
> > > ---------------------------------------------------------------------
> > > >>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.or=
g
> > > >>>> > For additional commands, e-mail:
java-user-help@lucene.apache.org
> > > >>>> >
> > > >>>> >
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Lance Norskog
> > > >>>> goksron@gmail.com
> > > >>>>
> > > >>>>
> > ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>> For additional commands, e-mail: java-user-help@lucene.apache.or=
g
> > > >>>>
> > > >>>>
> > > >>>
> > > >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger=
.
> > > Download Now! http://messenger.yahoo.com/download.php
> > > >>
> > > >>
---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >> Email has been scanned for viruses by Altman Technologies' email
> > > management service - www.altman.co.uk/emailsystems
> > > >>
> > > >> Glasgow Caledonian University is a registered Scottish charity,
number
> > > SC021474
> > > >>
> > > >> Winner: Times Higher Education=92s Widening Participation Initiati=
ve
of
> > > the Year 2009 and Herald Society=92s Education Initiative of the Year
2009
> > > >>
> > >
> >
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,e=
n.html
> > > >>
> > > > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > > Download Now! http://messenger.yahoo.com/download.php
> > > >
> > > >
---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
>
> --
>
> Hans Merkl
> Right On Point, LLC
> 215 Victor Parkway, Suite E
> Annapolis, MD 21403
>
> Phone: (443) 951-4324
> E-mail: hmerkl@rightonpoint.us

--20cf3054ace509c1e8049682ad87--