Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 28911 invoked from network); 3 Dec 2010 14:52:09 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Dec 2010 14:52:09 -0000 Received: (qmail 62496 invoked by uid 500); 3 Dec 2010 14:52:07 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 62089 invoked by uid 500); 3 Dec 2010 14:52:06 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 62080 invoked by uid 99); 3 Dec 2010 14:52:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Dec 2010 14:52:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rj.seward@gmail.com designates 209.85.213.176 as permitted sender) Received: from [209.85.213.176] (HELO mail-yx0-f176.google.com) (209.85.213.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Dec 2010 14:52:01 +0000 Received: by yxm8 with SMTP id 8so5557504yxm.35 for ; Fri, 03 Dec 2010 06:51:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=duQfXeGOSfeUCyEnj+i38Uhz41K1imys8TJaQbBsJ3E=; b=Dxhzxt8k6FNrkItiC4sUTfQTFDHW5ZQw6qcwTF38KuWxuImQzZgWy5wHnS1cojQwcL GGtcv4wQznxOzAKUQ97ubPuPwok17vDKVEQiy7pdexgn8FXs4NgjA7wlsFHFsPLDvbVB C8N1qw3TAUBdc+vBSila+yUHuD6OT9Nd//2Nc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=OuhHl/ZVEW+aI6gXNrOEQIpwH0NFv2lQVpHZNomvfsj/RAFe5GZPpsWXzG67/vvs6N sWhb74kectMDaCgTl8a/IC2sTbVqKlBWqqCIxRtPU+H4FD+ycOPwsvJDutbmPyV1KuQr 5QFWVzWnyHHuZgmdqkiiQN+2WAIsvjzcEpNpI= MIME-Version: 1.0 Received: by 10.223.86.4 with SMTP id q4mr2213402fal.20.1291387899854; Fri, 03 Dec 2010 06:51:39 -0800 (PST) Received: by 10.223.85.196 with HTTP; Fri, 3 Dec 2010 06:51:39 -0800 (PST) In-Reply-To: References: <52AECCE6F5764C2D903CD6D9956B7831@sv.us.sonicwall.com> <996005B2AD9E4482809CEF8E765197B7@sv.us.sonicwall.com> <90362ECB07906C40A0297F9D5211A24323C3D81F16@ITSEMBXCLUS.enterprise.gcal.ac.uk> Date: Fri, 3 Dec 2010 09:51:39 -0500 Message-ID: Subject: Re: PDF text extracted without spaces From: Ralph Seward To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=20cf3054ace509c1e8049682ad87 --20cf3054ace509c1e8049682ad87 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable pdftotext has usually worked quite well for my purposes. More info at http://www.foolabs.com/xpdf/about.html . "Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X components (pdftops, pdftotext, etc.) also run on Win32 systems and should run on pretty much any system with a decent C++ compiler." Ralph On Fri, Dec 3, 2010 at 9:35 AM, Hans Merkl wrote: > > pdftotext is much better and faster from my experience. > > > On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes wrote: > > > Have you ever tried other extractor tool than PDFBox? I used to extract > > contents with pdfbox: its capability of extract contents wasn't a problem, > > but its lack of structure information was. > > You can try poppler-utils (pdftotext) to extract contents with > > layout structure. > > > > Fabiano Nunes > > > > > > > > > > > > On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea wrote: > > > > > Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant. > > > Have you tried asking on the tika mailing list? > > > http://tika.apache.org/mail-lists.html. > > > > > > > > > -- > > > Ian. > > > > > > > > > On Fri, Dec 3, 2010 at 11:55 AM, Ganesh wrote= : > > > > I first extract the contents from documents using tika and latter index > > > it with Lucene. The problem is the extracted text from PDF using tika has > > no > > > whitespaces. > > > > > > > > Regards > > > > Ganesh > > > > > > > > > > > > ----- Original Message ----- > > > > From: "McGibbney, Lewis John" > > > > To: > > > > Sent: Friday, December 03, 2010 4:40 PM > > > > Subject: RE: PDF text extracted without spaces > > > > > > > > > > > >> Hi Ganesh > > > >> > > > >> I encountered this same problem last week. I was thinking if it wa= s > > > possible to include at minimum a WhitespaceAnalyzer somewhere within Tika > > > which would solve the problem. I am not sure of how this would be don= e as > > I > > > am not familiar with Tika codebase. > > > >> > > > >> Unfortunately I don't think that the solution to the first part of > > this > > > problem lies within the java-user mailing list. > > > >> > > > >> When were you sending extracted contents to Lucene... at what late= r > > > stage? > > > >> > > > >> Thank you > > > >> > > > >> Lewis > > > >> > > > >> -----Original Message----- > > > >> From: Ganesh [mailto:emailgane@yahoo.co.in] > > > >> Sent: 03 December 2010 10:44 > > > >> To: java-user@lucene.apache.org > > > >> Subject: Re: PDF text extracted without spaces > > > >> > > > >> The main problem is i am not getting whitespace and newline char. This > > > is happening only for PDF documents. > > > >> > > > >> Sample outoput: Someofthedifferencesare but it should be Some of the > > > differences are > > > >> > > > >> Regards > > > >> Ganesh > > > >> > > > >> ----- Original Message ----- > > > >> From: "Alexander Aristov" > > > >> To: > > > >> Sent: Friday, December 03, 2010 2:39 PM > > > >> Subject: Re: PDF text extracted without spaces > > > >> > > > >> > > > >>> anyway even if you get correct whitespaces and new lines this won't > > > affect > > > >>> indexing. > > > >>> > > > >>> Best Regards > > > >>> Alexander Aristov > > > >>> > > > >>> > > > >>> On 3 December 2010 10:00, Lance Norskog wrote= : > > > >>> > > > >>>> The text should come out as a stream of words with space, but > > without > > > >>>> any of the formatting in the PDF. Extraction is only good enough to > > > >>>> tell you that a word is somewhere inside a PDF file. Can you post a > > > >>>> short bit of the text that it extracted? > > > >>>> > > > >>>> Also, you should try this test on different PDF files that were made > > > >>>> with different software. > > > >>>> > > > >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh > > wrote: > > > >>>> > Hello all, > > > >>>> > > > > >>>> > I know, this is not the right group to ask this question, thought > > > some of > > > >>>> you guys might have experienced. > > > >>>> > > > > >>>> > I newbie with Tika. I am using latest version 0.8 version. I > > > extracted > > > >>>> text from PDF document but found spaces and new line missing. > > Indexing > > > the > > > >>>> data gives wrong result. Could any one in this group could help me? > > I > > > am > > > >>>> using tika directly to extract the contents, which later gets > > indexed. > > > >>>> > > > > >>>> > Regards > > > >>>> > Ganesh > > > >>>> > Send free SMS to your Friends on Mobile from your Yahoo! > > Messenger. > > > >>>> Download Now! http://messenger.yahoo.com/download.php > > > >>>> > > > > >>>> > > > > --------------------------------------------------------------------- > > > >>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.or= g > > > >>>> > For additional commands, e-mail: java-user-help@lucene.apache.org > > > >>>> > > > > >>>> > > > > >>>> > > > >>>> > > > >>>> > > > >>>> -- > > > >>>> Lance Norskog > > > >>>> goksron@gmail.com > > > >>>> > > > >>>> > > --------------------------------------------------------------------- > > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > >>>> For additional commands, e-mail: java-user-help@lucene.apache.or= g > > > >>>> > > > >>>> > > > >>> > > > >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger= . > > > Download Now! http://messenger.yahoo.com/download.php > > > >> > > > >> --------------------------------------------------------------------- > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > >> For additional commands, e-mail: java-user-help@lucene.apache.org > > > >> > > > >> Email has been scanned for viruses by Altman Technologies' email > > > management service - www.altman.co.uk/emailsystems > > > >> > > > >> Glasgow Caledonian University is a registered Scottish charity, number > > > SC021474 > > > >> > > > >> Winner: Times Higher Education=92s Widening Participation Initiati= ve of > > > the Year 2009 and Herald Society=92s Education Initiative of the Year 2009 > > > >> > > > > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,e= n.html > > > >> > > > > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. > > > Download Now! http://messenger.yahoo.com/download.php > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > > > > -- > > Hans Merkl > Right On Point, LLC > 215 Victor Parkway, Suite E > Annapolis, MD 21403 > > Phone: (443) 951-4324 > E-mail: hmerkl@rightonpoint.us --20cf3054ace509c1e8049682ad87--