Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of kalanir@gmail.com designates
 209.85.128.185 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:references;
        b=QIXtwV3q3lqSm8QT18HemylFzEyHAvUiV/DyN4uSls+MQ1nj4UoEFxOzbVJAGSacua
         RCcTqyWCZNGZsAWzPTZFTNDTI5N4TeY1BDAuXWLElE52dy+AAoMoVqnKeesErGKNXXUf
         uNhCTWpKxwQEmaee5Pk7PDyjGtgrra9Vo7s5A=
Message-ID: <5816fbdd0812040545x31b0233fjc4ccce3c329c8881@mail.gmail.com>
Date: Thu, 4 Dec 2008 19:15:39 +0530
From: "Kalani Ruwanpathirana" <kalanir@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Pdf in Lucene?
In-Reply-To: <COL113-W32CCABA46AA29E6A0A03BDBE020@phx.gbl>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_4892_10446687.1228398339982"
References: <COL113-W31062E878D138C1C72FB3EBE010@phx.gbl>
	 <8c4e68610812010343n62d7380bv52a224d7fefe6a2a@mail.gmail.com>
	 <COL113-W45437A1268795288A8839CBE010@phx.gbl>
	 <EB20F04E-B324-45C9-A7D8-70B4A6205052@apache.org>
	 <COL113-W455B5282CF986CD2D0A97EBE010@phx.gbl>
	 <81F5AE5D-760A-40C9-A4B8-2889C9E9CACD@apache.org>
	 <COL113-W8284C0CC1716F89F669205BE000@phx.gbl>
	 <5816fbdd0812040149x6702c556tf3d6c97e93d1d272@mail.gmail.com>
	 <COL113-W32CCABA46AA29E6A0A03BDBE020@phx.gbl>

------=_Part_4892_10446687.1228398339982
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hi Tiziano,

What is the error you got? I think you can get the text easily using the
code shown below.


FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));
cd.close();

After getting the value for text you can simply create the Lucene document.

Document doc = new Document();
            doc.add(new Field("id", "2", Field.Store.YES,
Field.Index.TOKENIZED));
            doc.add(new Field("content", docText,Field.Store.NO,
Field.Index.TOKENIZED));


On Thu, Dec 4, 2008 at 6:20 PM, tiziano bernardi <dk1982@hotmail.it> wrote:

>
> Thanks very kind ...
> But I've tried that code but I do not work ...
> You could send me a simple working class that uses it please?
> Thanks> Date: Thu, 4 Dec 2008 15:19:26 +0530> From: kalanir@gmail.com> To:
> java-user@lucene.apache.org> Subject: Re: Pdf in Lucene?> > Hi,> > In my
> case I used PDFBox, just to extract the text from PDF document and> then I
> created the Lucene document giving the extracted text. (I didn't use> the
> PDFBox built in Lucene search engine). So I didn't get any> incompatibility
> problems.> > This blog post shows the way.>
> http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.html>
> > It worked perfect for me.> > Thanks.
> _________________________________________________________________
> Ci sai fare con l'italiano? Scoprilo con Typectionary!
> http://typectionary.it.msn.com/
>


-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

------=_Part_4892_10446687.1228398339982--