lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: search trough single pdf document - return page number
Date Thu, 15 Oct 2009 14:07:41 GMT
It depends (tm). Do you want to permanently index this content and search it
multiple times or is each search a one-off? If the latter, I'd look for
packages specific to handling PDF files. Although since Reader takes forever
to search a document, so I suspect there's not much joy there.
If you want to parse the file once and search it many times, then yes,
Lucene can help a lot. You could conceivable do this in a memory index if
you didn't want a permanent copy. In this scheme, you'd index the file
before the first search then use the in-menory index until you were done
searching (assuming you wanted to search for different terms multiple
times). You'd have to do some record-keeping to remember what the start and
end offset of each page was so you could deal with the case that a phrases
you search for started on one page and ended on another.....

If this is off base, perhaps you could provide more details...

Erick

On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <idraganj@gmail.com> wrote:

>
> Hi,
>
> I have to search a single pdf document for requested string and if that
> string is found, I need to return a page number where that string was
> found.
> Requested string can be anything in a pdf document.
>
> It is a big document(abount 5000 pages) so I'm asking if that is possible
> with lucene.
>
> I'm using pdfbox class and i found a way to do it (searching with instring
> page by page) but it is too slow:
>
>        PDDocument pddDocument=PDDocument.load(f);
>
>        PDFTextStripper textStripper=new PDFTextStripper();
>        int lastpage = textStripper.getEndPage();
>        String page= null;
>        int found= 0;
>
>        for(int i=1; i<lastpage ; i++){
>            textStripper.setStartPage(i);
>            textStripper.setEndPage(i);
>
>            page = textStripper.getText(pddDocument);
>
>            found = page .indexOf(searchtext);
>
>            if (found>0) {returnpage= i; break;}
>        }
> ----------------
>
> Is there a way to speed up the search with lucene? Can I use indexing to
> solve this problem? thanks.
>
> --
> View this message in context:
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message