lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From IvanDrago <idrag...@gmail.com>
Subject search trough single pdf document - return page number
Date Thu, 15 Oct 2009 09:06:39 GMT

Hi,

I have to search a single pdf document for requested string and if that
string is found, I need to return a page number where that string was found.
Requested string can be anything in a pdf document.

It is a big document(abount 5000 pages) so I'm asking if that is possible
with lucene.

I'm using pdfbox class and i found a way to do it (searching with instring
page by page) but it is too slow:
        
        PDDocument pddDocument=PDDocument.load(f);

        PDFTextStripper textStripper=new PDFTextStripper();
        int lastpage = textStripper.getEndPage();
        String page= null;
        int found= 0;    
        
        for(int i=1; i<lastpage ; i++){
            textStripper.setStartPage(i);
            textStripper.setEndPage(i);            
            
            page = textStripper.getText(pddDocument);

            found = page .indexOf(searchtext);

            if (found>0) {returnpage= i; break;}
        }    
----------------

Is there a way to speed up the search with lucene? Can I use indexing to
solve this problem? thanks.

-- 
View this message in context: http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message