lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From IvanDrago <>
Subject search trough single pdf document - return page number
Date Thu, 15 Oct 2009 09:06:39 GMT


I have to search a single pdf document for requested string and if that
string is found, I need to return a page number where that string was found.
Requested string can be anything in a pdf document.

It is a big document(abount 5000 pages) so I'm asking if that is possible
with lucene.

I'm using pdfbox class and i found a way to do it (searching with instring
page by page) but it is too slow:
        PDDocument pddDocument=PDDocument.load(f);

        PDFTextStripper textStripper=new PDFTextStripper();
        int lastpage = textStripper.getEndPage();
        String page= null;
        int found= 0;    
        for(int i=1; i<lastpage ; i++){
            page = textStripper.getText(pddDocument);

            found = page .indexOf(searchtext);

            if (found>0) {returnpage= i; break;}

Is there a way to speed up the search with lucene? Can I use indexing to
solve this problem? thanks.

View this message in context:
Sent from the Lucene - Java Developer mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message