lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: getting problem while indexing pdf files with pdfbox
Date Tue, 17 Jul 2007 13:52:12 GMT
Offhand I'd assume that your problem is using PDFbox. Have you
tried printing out the docText string you get  back from

docText = stripper.getText(new PDDocument(cosDoc))?

I'd recommend you assure yourself that you get valid text back from
the PDF document before worrying about indexing it.

Best
Erick

On 7/17/07, neetika <neetika.markanday@iflexsolutions.com> wrote:
>
>
> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
>
> hi all,
>
> i am able to convert a pdf in to a text file using pdfbox.
> and this is the code that I used, but I am not able to index it
>
> // code for parsing and making index
>
>                 public Document getDocument(InputStream is)
>                 {
>                 COSDocument cosDoc = null;
>                 try {
>                         PDFParser parser = new PDFParser(is);
>                         parser.parse();
>                 cosDoc = parser.getDocument();
>                 }
>                 catch (IOException e) {
>                 e.printStackTrace();
>                                 }
>                 String docText = null;
>                 try {
>                 PDFTextStripper stripper = new PDFTextStripper();
>                 docText = stripper.getText(new PDDocument(cosDoc));
>                 }
>                 catch (IOException e) {
>                         e.printStackTrace();
>                 }
>                 Document doc = new Document();
>                 if (docText != null) {
>                 doc.add(new Field("body", docText, Field.Store.YES,
>                         Field.Index.TOKENIZED));
>                 }
>                 return doc;
>                 }
>
>                 public static void main(String[] args) throws Exception {
>                         TestPDFParser handler = new TestPDFParser();
>
>                 Document doc = handler.getDocument(new FileInputStream(new
> File("D:\\lucenePdf\\DRra0026.pdf")));
>
>                 System.out.println(doc);
>
>                 //Following code is for making index
>
>                 IndexWriter f_writer = new IndexWriter("D:\\lucenePdf",
> new
> StandardAnalyzer(), true);
>
>                 f_writer.addDocument(doc);
>
>                 }
>                 }
> //code for searching a particular string..
>
> public static void main(String[] args) throws Exception {
>         String indexDir = "D:\\lucenePdf";
>         String q = "RA0083";
>
>
>         Directory fsDir = FSDirectory.getDirectory(indexDir);
>         IndexSearcher is = new IndexSearcher(fsDir);
>
>         Query query = new QueryParser("body", new
> StandardAnalyzer()).parse(q);
>
>         Hits hits = is.search(query);
>         System.out.println("Found " + hits.length() + " documents that
> matched query '" + q + "':");
>         for (int i = 0; i < hits.length(); i++) {
>             Document doc = hits.doc(i);
>
>         }
>     }
>
>
> When I run the above code...I get folowing output as a result of running
> indexer class
>
>
> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT
> : RA0083
> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083
>
> and following  files are generated in the specified path..
>
> segments.gen
> write.lock
> segments_4
>
>
> but when I run the search class it gives the result as:
>
> Found 0 documents that matched query 'RA0083':
>
> I am also attaching the corresponding pdf file for reference.
> It seems as the index is not getting created..
>
> Please help me with some of your inputs,it will be very helpfull for me.
> --
> View this message in context:
> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message