Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 29033 invoked from network); 3 Dec 2010 10:44:31 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Dec 2010 10:44:31 -0000 Received: (qmail 36850 invoked by uid 500); 3 Dec 2010 10:44:28 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 36707 invoked by uid 500); 3 Dec 2010 10:44:28 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 36695 invoked by uid 99); 3 Dec 2010 10:44:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Dec 2010 10:44:27 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=FREEMAIL_FROM,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_NEUTRAL,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [121.101.151.210] (HELO nm2-vm0.bullet.mail.in.yahoo.com) (121.101.151.210) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 03 Dec 2010 10:44:19 +0000 Received: from [121.101.151.238] by nm2.bullet.mail.in.yahoo.com with NNFMP; 03 Dec 2010 10:43:56 -0000 Received: from [121.101.151.235] by tm3.bullet.mail.in.yahoo.com with NNFMP; 03 Dec 2010 10:44:00 -0000 Received: from [127.0.0.1] by omp1004.mail.in.yahoo.com with NNFMP; 03 Dec 2010 10:44:26 -0000 X-Yahoo-Newman-Id: 49984.35235.bm@omp1004.mail.in.yahoo.com Received: (qmail 23382 invoked from network); 3 Dec 2010 10:43:57 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.in; h=DKIM-Signature:Received:X-Yahoo-SMTP:X-YMail-OSG:X-Yahoo-Newman-Property:Message-ID:From:To:References:Subject:Date:MIME-Version:Content-Type:Content-Transfer-Encoding:X-Priority:X-MSMail-Priority:X-Mailer:X-MimeOLE; b=MvYApajZGhSEu7e0wMWip3fBNWWQd2V4S3gh8Ml1lzcrwZbjeT8Iy/DeNhxmXd40dTp0TFjtLQRKnXZ0DXNFFWLZcets9V0pyrQKGattgtUro+68uhnpGisp17HqpuzHSpQEC49n/vGDIvPOizDB7Apya7gvTTprpimYz1Gv1/8= ; DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.co.in; s=s1024; t=1291373037; bh=Fc2SPMGCNP8wa+5VuqKQHhrDUx33GEbrJELRpQQFUAg=; h=Received:X-Yahoo-SMTP:X-YMail-OSG:X-Yahoo-Newman-Property:Message-ID:From:To:References:Subject:Date:MIME-Version:Content-Type:Content-Transfer-Encoding:X-Priority:X-MSMail-Priority:X-Mailer:X-MimeOLE; b=rVIGribTdprMrZTs/s/A5s7y3GY85YCcMpF8POB89vNM1WInVjaDzGORYlTT6QXWefOrFlK5Rif5sRmMheofq8m7bmLNYKtkjWHRAzeQXwABr1XP1CG0gKUdlCp7lgqxyHWEqUFDazwUNY85YAUwBlzn/SrfSO8XpY00YOaXOkI= Received: from GaneshM (emailgane@121.244.159.130 with login) by smtp105.mail.in.yahoo.com with SMTP; 03 Dec 2010 16:13:57 +0530 IST X-Yahoo-SMTP: JObyHkuswBBrNSLZp.Ycd7Boqpr_GQ-- X-YMail-OSG: qTU6uLIVM1k9TpOoPrQTtGb53DSGaJZUyPoSehdfSNZBTg0 QTrbsQbPsuf6K.mHrDStSe54891LDCzS7kufNUpChRd.hdgeC9Y2fhANYzm5 aT0Y4B1zRYBOnGiGfeLnoGsO9Igj.DfpnzTSK.fONFd6pMfhJMPIF831kf_8 MzI5Od4MgKyq7P9Zfr8DZyybClkyAN1Vx6TnAISzmzsx__PUxU3g8nB7r5bE xqEC03pwYwlbpAsjml9rsR9r1KZhz0w5azT5uNkYG5Art6d8.jw-- X-Yahoo-Newman-Property: ymail-3 Message-ID: <996005B2AD9E4482809CEF8E765197B7@sv.us.sonicwall.com> From: "Ganesh" To: References: <52AECCE6F5764C2D903CD6D9956B7831@sv.us.sonicwall.com> Subject: Re: PDF text extracted without spaces Date: Fri, 3 Dec 2010 16:13:55 +0530 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5931 X-Virus-Checked: Checked by ClamAV on apache.org The main problem is i am not getting whitespace and newline char. This = is happening only for PDF documents.=20 Sample outoput: Someofthedifferencesare but it should be Some of the = differences are Regards Ganesh ----- Original Message -----=20 From: "Alexander Aristov" To: Sent: Friday, December 03, 2010 2:39 PM Subject: Re: PDF text extracted without spaces > anyway even if you get correct whitespaces and new lines this won't = affect > indexing. >=20 > Best Regards > Alexander Aristov >=20 >=20 > On 3 December 2010 10:00, Lance Norskog wrote: >=20 >> The text should come out as a stream of words with space, but without >> any of the formatting in the PDF. Extraction is only good enough to >> tell you that a word is somewhere inside a PDF file. Can you post a >> short bit of the text that it extracted? >> >> Also, you should try this test on different PDF files that were made >> with different software. >> >> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh wrote: >> > Hello all, >> > >> > I know, this is not the right group to ask this question, thought = some of >> you guys might have experienced. >> > >> > I newbie with Tika. I am using latest version 0.8 version. I = extracted >> text from PDF document but found spaces and new line missing. = Indexing the >> data gives wrong result. Could any one in this group could help me? I = am >> using tika directly to extract the contents, which later gets = indexed. >> > >> > Regards >> > Ganesh >> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. >> Download Now! http://messenger.yahoo.com/download.php >> > >> > = --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> > For additional commands, e-mail: java-user-help@lucene.apache.org >> > >> > >> >> >> >> -- >> Lance Norskog >> goksron@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org