Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 54987 invoked from network); 3 Dec 2010 11:56:26 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Dec 2010 11:56:26 -0000 Received: (qmail 7665 invoked by uid 500); 3 Dec 2010 11:56:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 7432 invoked by uid 500); 3 Dec 2010 11:56:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 7424 invoked by uid 99); 3 Dec 2010 11:56:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Dec 2010 11:56:23 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=FREEMAIL_FROM,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_NEUTRAL,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [121.101.151.226] (HELO nm4.bullet.mail.in.yahoo.com) (121.101.151.226) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 03 Dec 2010 11:56:16 +0000 Received: from [121.101.151.237] by nm4.bullet.mail.in.yahoo.com with NNFMP; 03 Dec 2010 11:56:02 -0000 Received: from [121.101.151.233] by tm2.bullet.mail.in.yahoo.com with NNFMP; 03 Dec 2010 11:55:53 -0000 Received: from [127.0.0.1] by omp1002.mail.in.yahoo.com with NNFMP; 03 Dec 2010 11:55:52 -0000 X-Yahoo-Newman-Id: 134044.14843.bm@omp1002.mail.in.yahoo.com Received: (qmail 4026 invoked from network); 3 Dec 2010 11:55:53 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.in; h=DKIM-Signature:Received:X-Yahoo-SMTP:X-YMail-OSG:X-Yahoo-Newman-Property:Message-ID:From:To:References:Subject:Date:MIME-Version:Content-Type:Content-Transfer-Encoding:X-Priority:X-MSMail-Priority:X-Mailer:X-MimeOLE; b=YVTc254als5mmul+D/YKvMGPf8a1/udTGE/VsFyVjJiNhqKY7I6QpjYD7DvSKkgeeli0PZ6VJETeG8zZDmbfhfhUJh3i3IfWonK6AbCxL38aeNY8rZZAiCgdjiWtjm/UDeGdonQIG4kuAL1OuN/EBHaB/4Pm0isOQ1NhzmXIGAY= ; DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.co.in; s=s1024; t=1291377353; bh=tHgQ5XarcR9xIDrSjZ9ZHKow+q6VhRiQM+vy5P8DEMA=; h=Received:X-Yahoo-SMTP:X-YMail-OSG:X-Yahoo-Newman-Property:Message-ID:From:To:References:Subject:Date:MIME-Version:Content-Type:Content-Transfer-Encoding:X-Priority:X-MSMail-Priority:X-Mailer:X-MimeOLE; b=T+WTDu6K37+XMmhw3X4OCrSQgnPxHrZ3egOCD8n+bZEFWpbe9WCcZS/arxb76ScGSJPo7ZIg6IeQiBn/dLNwmUq9IWqFCEI+iWAV1X+JGw6sIEJ5x/sXXrcmY076NEsfWnsY4zPXY0CQxl5OqwD/LTVTygRk1qCXzOFO1t0IfVM= Received: from GaneshM (emailgane@121.244.159.130 with login) by smtp101.mail.in.yahoo.com with SMTP; 03 Dec 2010 17:25:53 +0530 IST X-Yahoo-SMTP: JObyHkuswBBrNSLZp.Ycd7Boqpr_GQ-- X-YMail-OSG: brTzV8cVM1nFxYf_8z3eqeUp8JVbnLaDut7VNeFEREMO3yd ueTVvQj4jsxabbfQtXYKtomwqzT5iQHo1A2ZzxgOWmagp2VeHCNZlhmShiRY WBOzE2izDMN1_PINXC49oJBqVfSscllAJm4KZJdhnKEoFZnn_ge.jmLRahMf whMTtO9lQsyHkwMfAPD_pHz14ImVJNFeu.91H0YuaQKZmBGlNzKiHsRA1nKq _WNRUv_OrTYi8w42zv3jSYewAKR.Gyev7BGEgnRHAQ7C5D_nI09a7 X-Yahoo-Newman-Property: ymail-3 Message-ID: From: "Ganesh" To: References: <52AECCE6F5764C2D903CD6D9956B7831@sv.us.sonicwall.com> <996005B2AD9E4482809CEF8E765197B7@sv.us.sonicwall.com> <90362ECB07906C40A0297F9D5211A24323C3D81F16@ITSEMBXCLUS.enterprise.gcal.ac.uk> Subject: Re: PDF text extracted without spaces Date: Fri, 3 Dec 2010 17:25:52 +0530 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5931 X-Virus-Checked: Checked by ClamAV on apache.org I first extract the contents from documents using tika and latter index = it with Lucene. The problem is the extracted text from PDF using tika = has no whitespaces.=20 Regards Ganesh ----- Original Message -----=20 From: "McGibbney, Lewis John" To: Sent: Friday, December 03, 2010 4:40 PM Subject: RE: PDF text extracted without spaces > Hi Ganesh >=20 > I encountered this same problem last week. I was thinking if it was = possible to include at minimum a WhitespaceAnalyzer somewhere within = Tika which would solve the problem. I am not sure of how this would be = done as I am not familiar with Tika codebase. >=20 > Unfortunately I don't think that the solution to the first part of = this problem lies within the java-user mailing list. >=20 > When were you sending extracted contents to Lucene... at what later = stage? >=20 > Thank you >=20 > Lewis >=20 > -----Original Message----- > From: Ganesh [mailto:emailgane@yahoo.co.in] > Sent: 03 December 2010 10:44 > To: java-user@lucene.apache.org > Subject: Re: PDF text extracted without spaces >=20 > The main problem is i am not getting whitespace and newline char. This = is happening only for PDF documents. >=20 > Sample outoput: Someofthedifferencesare but it should be Some of the = differences are >=20 > Regards > Ganesh >=20 > ----- Original Message ----- > From: "Alexander Aristov" > To: > Sent: Friday, December 03, 2010 2:39 PM > Subject: Re: PDF text extracted without spaces >=20 >=20 >> anyway even if you get correct whitespaces and new lines this won't = affect >> indexing. >> >> Best Regards >> Alexander Aristov >> >> >> On 3 December 2010 10:00, Lance Norskog wrote: >> >>> The text should come out as a stream of words with space, but = without >>> any of the formatting in the PDF. Extraction is only good enough to >>> tell you that a word is somewhere inside a PDF file. Can you post a >>> short bit of the text that it extracted? >>> >>> Also, you should try this test on different PDF files that were made >>> with different software. >>> >>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh = wrote: >>> > Hello all, >>> > >>> > I know, this is not the right group to ask this question, thought = some of >>> you guys might have experienced. >>> > >>> > I newbie with Tika. I am using latest version 0.8 version. I = extracted >>> text from PDF document but found spaces and new line missing. = Indexing the >>> data gives wrong result. Could any one in this group could help me? = I am >>> using tika directly to extract the contents, which later gets = indexed. >>> > >>> > Regards >>> > Ganesh >>> > Send free SMS to your Friends on Mobile from your Yahoo! = Messenger. >>> Download Now! http://messenger.yahoo.com/download.php >>> > >>> > = --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> > For additional commands, e-mail: java-user-help@lucene.apache.org >>> > >>> > >>> >>> >>> >>> -- >>> Lance Norskog >>> goksron@gmail.com >>> >>> = --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. = Download Now! http://messenger.yahoo.com/download.php >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 > Email has been scanned for viruses by Altman Technologies' email = management service - www.altman.co.uk/emailsystems >=20 > Glasgow Caledonian University is a registered Scottish charity, number = SC021474 >=20 > Winner: Times Higher Education=E2=80=99s Widening Participation = Initiative of the Year 2009 and Herald Society=E2=80=99s Education = Initiative of the Year 2009 > = http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219= ,en.html > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org