lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From neetika <neetika.markan...@iflexsolutions.com>
Subject Re: getting problem while indexing pdf files with pdfbox
Date Wed, 18 Jul 2007 04:29:36 GMT

Hi Erick,

I am able to get the result fine.
The problem was, I forgot to close the writer and so the index file (.cfs)
was not getting generated.

Thanks a lot for the timely help.

Regards,
Neetika


Erick Erickson wrote:
> 
>  You have NOT supplied an example of the text you extracted
> from the document. But let's assume that the interesting
> string is exactly what you expect.
> 
> Have you looked at your index with Luke to see if the data is there?
> I *strongly* suggest you get a copy of Luke (google lucene luke) to
> examine indexes with.
> 
> The existence of the write.lock file suggests that you haven't closed your
> index prior to searching it. Although flushing it would probably work. Be
> aware that you cannot see changes to an index if the reader you use
> has been opened before the indexing operation. Also, there is some period
> of time when the indexed data is buffered by the writer, and I'm unsure
> (but
> doubt) it's available until it's been flushed.
> 
> I suspect that your problem is not related to PDF, but rather to whether
> you've properly indexed data and closed your index prior to searching
> it.....
> 
> The other possibility is that your analyzer is parsing things
> "interestingly".
> StandardAnalyzer() does some interesting things when tokenizing, including
> lowercasing the input stream. Although that shouldn't have been a problem
> since you use the same analyzer for indexing and searching.
> 
> Also, try query.toString to see what is actually searched, that often
> gives
> insights. The aforementioned Luke will allow you to submit queries to the
> index, including explaining what the actual query produced is.
> What are the file sizes of your index files?
> 
> 
> Best
> Erick
> 
> On 7/17/07, neetika <neetika.markanday@iflexsolutions.com> wrote:
>>
>>
>> Hi Erick,
>>
>> Befoe indexing I have printed the doc, and I have given the output
>> also.It
>> is printing well.
>> Kindly please check my post again following...
>>
>> " System.out.println(doc);
>>                  //Following code is for making index"
>>
>> and the corresponding output is...
>>
>>
>> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT:
>> RA0083
>> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
>> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
>> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
>> 0000099062000065700000021000000100550468156201102006PAYOUT : RA0083
>> which is as expected...but  my problem is...index file is not getting
>> generated.
>>
>> Please help
>>
>>
>>
>> Erick Erickson wrote:
>> >
>> > Offhand I'd assume that your problem is using PDFbox. Have you
>> > tried printing out the docText string you get  back from
>> >
>> > docText = stripper.getText(new PDDocument(cosDoc))?
>> >
>> > I'd recommend you assure yourself that you get valid text back from
>> > the PDF document before worrying about indexing it.
>> >
>> > Best
>> > Erick
>> >
>> > On 7/17/07, neetika <neetika.markanday@iflexsolutions.com> wrote:
>> >>
>> >>
>> >> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
>> >>
>> >> hi all,
>> >>
>> >> i am able to convert a pdf in to a text file using pdfbox.
>> >> and this is the code that I used, but I am not able to index it
>> >>
>> >> // code for parsing and making index
>> >>
>> >>                 public Document getDocument(InputStream is)
>> >>                 {
>> >>                 COSDocument cosDoc = null;
>> >>                 try {
>> >>                         PDFParser parser = new PDFParser(is);
>> >>                         parser.parse();
>> >>                 cosDoc = parser.getDocument();
>> >>                 }
>> >>                 catch (IOException e) {
>> >>                 e.printStackTrace();
>> >>                                 }
>> >>                 String docText = null;
>> >>                 try {
>> >>                 PDFTextStripper stripper = new PDFTextStripper();
>> >>                 docText = stripper.getText(new PDDocument(cosDoc));
>> >>                 }
>> >>                 catch (IOException e) {
>> >>                         e.printStackTrace();
>> >>                 }
>> >>                 Document doc = new Document();
>> >>                 if (docText != null) {
>> >>                 doc.add(new Field("body", docText, Field.Store.YES,
>> >>                         Field.Index.TOKENIZED));
>> >>                 }
>> >>                 return doc;
>> >>                 }
>> >>
>> >>                 public static void main(String[] args) throws
>> Exception
>> {
>> >>                         TestPDFParser handler = new TestPDFParser();
>> >>
>> >>                 Document doc = handler.getDocument(new
>> >> FileInputStream(new
>> >> File("D:\\lucenePdf\\DRra0026.pdf")));
>> >>
>> >>                 System.out.println(doc);
>> >>
>> >>                 //Following code is for making index
>> >>
>> >>                 IndexWriter f_writer = new
>> IndexWriter("D:\\lucenePdf",
>> >> new
>> >> StandardAnalyzer(), true);
>> >>
>> >>                 f_writer.addDocument(doc);
>> >>
>> >>                 }
>> >>                 }
>> >> //code for searching a particular string..
>> >>
>> >> public static void main(String[] args) throws Exception {
>> >>         String indexDir = "D:\\lucenePdf";
>> >>         String q = "RA0083";
>> >>
>> >>
>> >>         Directory fsDir = FSDirectory.getDirectory(indexDir);
>> >>         IndexSearcher is = new IndexSearcher(fsDir);
>> >>
>> >>         Query query = new QueryParser("body", new
>> >> StandardAnalyzer()).parse(q);
>> >>
>> >>         Hits hits = is.search(query);
>> >>         System.out.println("Found " + hits.length() + " documents that
>> >> matched query '" + q + "':");
>> >>         for (int i = 0; i < hits.length(); i++) {
>> >>             Document doc = hits.doc(i);
>> >>
>> >>         }
>> >>     }
>> >>
>> >>
>> >> When I run the above code...I get folowing output as a result of
>> running
>> >> indexer class
>> >>
>> >>
>> >>
>> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT
>> >> : RA0083
>> >> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
>> >> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
>> >> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
>> >> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083
>> >>
>> >> and following  files are generated in the specified path..
>> >>
>> >> segments.gen
>> >> write.lock
>> >> segments_4
>> >>
>> >>
>> >> but when I run the search class it gives the result as:
>> >>
>> >> Found 0 documents that matched query 'RA0083':
>> >>
>> >> I am also attaching the corresponding pdf file for reference.
>> >> It seems as the index is not getting created..
>> >>
>> >> Please help me with some of your inputs,it will be very helpfull for
>> me.
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11662355
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message