Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 82201 invoked from network); 4 Dec 2008 13:46:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Dec 2008 13:46:28 -0000 Received: (qmail 65363 invoked by uid 500); 4 Dec 2008 13:46:33 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 65049 invoked by uid 500); 4 Dec 2008 13:46:32 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 65037 invoked by uid 99); 4 Dec 2008 13:46:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Dec 2008 05:46:32 -0800 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kalanir@gmail.com designates 209.85.128.185 as permitted sender) Received: from [209.85.128.185] (HELO fk-out-0910.google.com) (209.85.128.185) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Dec 2008 13:45:03 +0000 Received: by fk-out-0910.google.com with SMTP id 18so3457820fkq.5 for ; Thu, 04 Dec 2008 05:45:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=ryA/OUKhRWP1t3zZt1Xz51dLJIruw5OdBtOgDQ2xi9Y=; b=pJcuoRiIfpC1cizeug5Gj9nF5mz3MsLaZnX1idqQhX8r32TZYXswVATrsjzCz77/UD H2ddW+LQG2aEIx3i4a1LE5bGlDQ1DyXTm0d/p/wtI3If0guG9R2XofJ6SVQH7C8vlFm0 YVkWzhBxI/6tIy20KCB2nSXtKQsa4auZfzMXA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=QIXtwV3q3lqSm8QT18HemylFzEyHAvUiV/DyN4uSls+MQ1nj4UoEFxOzbVJAGSacua RCcTqyWCZNGZsAWzPTZFTNDTI5N4TeY1BDAuXWLElE52dy+AAoMoVqnKeesErGKNXXUf uNhCTWpKxwQEmaee5Pk7PDyjGtgrra9Vo7s5A= Received: by 10.223.110.211 with SMTP id o19mr332199fap.57.1228398339990; Thu, 04 Dec 2008 05:45:39 -0800 (PST) Received: by 10.223.109.20 with HTTP; Thu, 4 Dec 2008 05:45:39 -0800 (PST) Message-ID: <5816fbdd0812040545x31b0233fjc4ccce3c329c8881@mail.gmail.com> Date: Thu, 4 Dec 2008 19:15:39 +0530 From: "Kalani Ruwanpathirana" To: java-user@lucene.apache.org Subject: Re: Pdf in Lucene? In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_4892_10446687.1228398339982" References: <8c4e68610812010343n62d7380bv52a224d7fefe6a2a@mail.gmail.com> <81F5AE5D-760A-40C9-A4B8-2889C9E9CACD@apache.org> <5816fbdd0812040149x6702c556tf3d6c97e93d1d272@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_4892_10446687.1228398339982 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi Tiziano, What is the error you got? I think you can get the text easily using the code shown below. FileInputStream fi = new FileInputStream(new File("sample.pdf")); PDFParser parser = new PDFParser(fi); parser.parse(); COSDocument cd = parser.getDocument(); PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(new PDDocument(cd)); cd.close(); After getting the value for text you can simply create the Lucene document. Document doc = new Document(); doc.add(new Field("id", "2", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("content", docText,Field.Store.NO, Field.Index.TOKENIZED)); On Thu, Dec 4, 2008 at 6:20 PM, tiziano bernardi wrote: > > Thanks very kind ... > But I've tried that code but I do not work ... > You could send me a simple working class that uses it please? > Thanks> Date: Thu, 4 Dec 2008 15:19:26 +0530> From: kalanir@gmail.com> To: > java-user@lucene.apache.org> Subject: Re: Pdf in Lucene?> > Hi,> > In my > case I used PDFBox, just to extract the text from PDF document and> then I > created the Lucene document giving the extracted text. (I didn't use> the > PDFBox built in Lucene search engine). So I didn't get any> incompatibility > problems.> > This blog post shows the way.> > http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.html> > > It worked perfect for me.> > Thanks. > _________________________________________________________________ > Ci sai fare con l'italiano? Scoprilo con Typectionary! > http://typectionary.it.msn.com/ > -- Kalani Ruwanpathirana Department of Computer Science & Engineering University of Moratuwa ------=_Part_4892_10446687.1228398339982--