lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: search trough single pdf document - return page number
Date Thu, 15 Oct 2009 14:07:41 GMT
It depends (tm). Do you want to permanently index this content and search it
multiple times or is each search a one-off? If the latter, I'd look for
packages specific to handling PDF files. Although since Reader takes forever
to search a document, so I suspect there's not much joy there.
If you want to parse the file once and search it many times, then yes,
Lucene can help a lot. You could conceivable do this in a memory index if
you didn't want a permanent copy. In this scheme, you'd index the file
before the first search then use the in-menory index until you were done
searching (assuming you wanted to search for different terms multiple
times). You'd have to do some record-keeping to remember what the start and
end offset of each page was so you could deal with the case that a phrases
you search for started on one page and ended on another.....

If this is off base, perhaps you could provide more details...


On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <> wrote:

> Hi,
> I have to search a single pdf document for requested string and if that
> string is found, I need to return a page number where that string was
> found.
> Requested string can be anything in a pdf document.
> It is a big document(abount 5000 pages) so I'm asking if that is possible
> with lucene.
> I'm using pdfbox class and i found a way to do it (searching with instring
> page by page) but it is too slow:
>        PDDocument pddDocument=PDDocument.load(f);
>        PDFTextStripper textStripper=new PDFTextStripper();
>        int lastpage = textStripper.getEndPage();
>        String page= null;
>        int found= 0;
>        for(int i=1; i<lastpage ; i++){
>            textStripper.setStartPage(i);
>            textStripper.setEndPage(i);
>            page = textStripper.getText(pddDocument);
>            found = page .indexOf(searchtext);
>            if (found>0) {returnpage= i; break;}
>        }
> ----------------
> Is there a way to speed up the search with lucene? Can I use indexing to
> solve this problem? thanks.
> --
> View this message in context:
> Sent from the Lucene - Java Developer mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message