Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 36484 invoked from network); 17 Jul 2007 18:08:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Jul 2007 18:08:31 -0000 Received: (qmail 46103 invoked by uid 500); 17 Jul 2007 18:08:22 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 46051 invoked by uid 500); 17 Jul 2007 18:08:22 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 46036 invoked by uid 99); 17 Jul 2007 18:08:22 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jul 2007 11:08:22 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com designates 64.233.182.190 as permitted sender) Received: from [64.233.182.190] (HELO nf-out-0910.google.com) (64.233.182.190) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jul 2007 11:08:17 -0700 Received: by nf-out-0910.google.com with SMTP id d3so113313nfc for ; Tue, 17 Jul 2007 11:07:55 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=V+6m+Cg70LrSsiod23tUC3Q5/dKZIJ5ChxDa3BrgJp9kLxFB6u810TZACJWGExWjR8ic27s8gZbfqzFJam7or2Jt279qZa7ojd4f1S35z8pvk2iJjaQWpA3A4dV64phY1Xz8xumLsWxELzcIZpqB1sb2J6bWBGGpsMGdO05T50g= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=jIy1QHf2KcP9aCzEUsJAQk0v+S7sPvN/qDr/VlROslxl801qfMC+xf/8fFnwVvTDp95Vxsa/qk+G5BgtBHDdGve8+NV1H1P51eaeZP4fdHY9LNVGO7zUlbnCzORP0OaG1mzx3IyxmfqagX+mrO2nT3PjgoGxW+vMSuDLqeG5mNQ= Received: by 10.82.108.9 with SMTP id g9mr816261buc.1184695672141; Tue, 17 Jul 2007 11:07:52 -0700 (PDT) Received: by 10.82.167.3 with HTTP; Tue, 17 Jul 2007 11:07:52 -0700 (PDT) Message-ID: <359a92830707171107s125cdebaqcbb4e210a0c7ad95@mail.gmail.com> Date: Tue, 17 Jul 2007 14:07:52 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: getting problem while indexing pdf files with pdfbox In-Reply-To: <11653883.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_60549_29576718.1184695672101" References: <11647342.post@talk.nabble.com> <359a92830707170652o6de17af5u38bb4f7bbf6a565d@mail.gmail.com> <11653883.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_60549_29576718.1184695672101 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline You have NOT supplied an example of the text you extracted from the document. But let's assume that the interesting string is exactly what you expect. Have you looked at your index with Luke to see if the data is there? I *strongly* suggest you get a copy of Luke (google lucene luke) to examine indexes with. The existence of the write.lock file suggests that you haven't closed your index prior to searching it. Although flushing it would probably work. Be aware that you cannot see changes to an index if the reader you use has been opened before the indexing operation. Also, there is some period of time when the indexed data is buffered by the writer, and I'm unsure (but doubt) it's available until it's been flushed. I suspect that your problem is not related to PDF, but rather to whether you've properly indexed data and closed your index prior to searching it..... The other possibility is that your analyzer is parsing things "interestingly". StandardAnalyzer() does some interesting things when tokenizing, including lowercasing the input stream. Although that shouldn't have been a problem since you use the same analyzer for indexing and searching. Also, try query.toString to see what is actually searched, that often gives insights. The aforementioned Luke will allow you to submit queries to the index, including explaining what the actual query produced is. What are the file sizes of your index files? Best Erick On 7/17/07, neetika wrote: > > > Hi Erick, > > Befoe indexing I have printed the doc, and I have given the output also.It > is printing well. > Kindly please check my post again following... > > " System.out.println(doc); > //Following code is for making index" > > and the corresponding output is... > > > Document RA0083 > 000099062000062000000021000000100220468148001102006PAYOUT : RA0083 > 000099062000063000000021000000100330468153601102006PAYOUT : RA0083 > 000099062000064700000021000000100440468155401102006PAYOUT : RA0083 > 0000099062000065700000021000000100550468156201102006PAYOUT : RA0083 > which is as expected...but my problem is...index file is not getting > generated. > > Please help > > > > Erick Erickson wrote: > > > > Offhand I'd assume that your problem is using PDFbox. Have you > > tried printing out the docText string you get back from > > > > docText = stripper.getText(new PDDocument(cosDoc))? > > > > I'd recommend you assure yourself that you get valid text back from > > the PDF document before worrying about indexing it. > > > > Best > > Erick > > > > On 7/17/07, neetika wrote: > >> > >> > >> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf > >> > >> hi all, > >> > >> i am able to convert a pdf in to a text file using pdfbox. > >> and this is the code that I used, but I am not able to index it > >> > >> // code for parsing and making index > >> > >> public Document getDocument(InputStream is) > >> { > >> COSDocument cosDoc = null; > >> try { > >> PDFParser parser = new PDFParser(is); > >> parser.parse(); > >> cosDoc = parser.getDocument(); > >> } > >> catch (IOException e) { > >> e.printStackTrace(); > >> } > >> String docText = null; > >> try { > >> PDFTextStripper stripper = new PDFTextStripper(); > >> docText = stripper.getText(new PDDocument(cosDoc)); > >> } > >> catch (IOException e) { > >> e.printStackTrace(); > >> } > >> Document doc = new Document(); > >> if (docText != null) { > >> doc.add(new Field("body", docText, Field.Store.YES, > >> Field.Index.TOKENIZED)); > >> } > >> return doc; > >> } > >> > >> public static void main(String[] args) throws Exception > { > >> TestPDFParser handler = new TestPDFParser(); > >> > >> Document doc = handler.getDocument(new > >> FileInputStream(new > >> File("D:\\lucenePdf\\DRra0026.pdf"))); > >> > >> System.out.println(doc); > >> > >> //Following code is for making index > >> > >> IndexWriter f_writer = new IndexWriter("D:\\lucenePdf", > >> new > >> StandardAnalyzer(), true); > >> > >> f_writer.addDocument(doc); > >> > >> } > >> } > >> //code for searching a particular string.. > >> > >> public static void main(String[] args) throws Exception { > >> String indexDir = "D:\\lucenePdf"; > >> String q = "RA0083"; > >> > >> > >> Directory fsDir = FSDirectory.getDirectory(indexDir); > >> IndexSearcher is = new IndexSearcher(fsDir); > >> > >> Query query = new QueryParser("body", new > >> StandardAnalyzer()).parse(q); > >> > >> Hits hits = is.search(query); > >> System.out.println("Found " + hits.length() + " documents that > >> matched query '" + q + "':"); > >> for (int i = 0; i < hits.length(); i++) { > >> Document doc = hits.doc(i); > >> > >> } > >> } > >> > >> > >> When I run the above code...I get folowing output as a result of > running > >> indexer class > >> > >> > >> > Document >> : RA0083 > >> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083 > >> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083 > >> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083 > >> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083 > >> > >> and following files are generated in the specified path.. > >> > >> segments.gen > >> write.lock > >> segments_4 > >> > >> > >> but when I run the search class it gives the result as: > >> > >> Found 0 documents that matched query 'RA0083': > >> > >> I am also attaching the corresponding pdf file for reference. > >> It seems as the index is not getting created.. > >> > >> Please help me with some of your inputs,it will be very helpfull for > me. > >> -- > >> View this message in context: > >> > http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342 > >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_60549_29576718.1184695672101--