Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 19134 invoked from network); 15 Oct 2009 14:08:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 15 Oct 2009 14:08:13 -0000 Received: (qmail 67510 invoked by uid 500); 15 Oct 2009 14:08:12 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 67448 invoked by uid 500); 15 Oct 2009 14:08:12 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 67440 invoked by uid 99); 15 Oct 2009 14:08:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Oct 2009 14:08:12 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of erickerickson@gmail.com designates 209.85.223.176 as permitted sender) Received: from [209.85.223.176] (HELO mail-iw0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Oct 2009 14:08:02 +0000 Received: by iwn6 with SMTP id 6so464244iwn.20 for ; Thu, 15 Oct 2009 07:07:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=/CjvwIhD4VanrEQ6FJseDRG2AXKyYb9sR8U4jYvcZJg=; b=lgmYgMJL/oIslB1H64BtIebjBoc234F5HNfCtO+VKSTt/qE8l6SSnbGsmqjwO+lyzc NRlpDWRMjG1LjW5qpUnPUQdUeMgeJlEr0ZS9KMvkbptbjRIIAjpJz8gkZbwPeaFQAMRV Ww/jIGiXYFKhSeeDhLFNuZ7PZ0CvDCyR6hHmU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=fMvKMMxAXBVLsxrHeOyRGyGKCzo0SzrosZM4EsUEb9sBzVxdUQZrf6iylVMyUG74Qt Hqdwk0Fuf26hlXLqrxKsnnDm2zTdiMy2CtiTH0+NWHfoxLPMwMvu8+HMT1bI7QimdZ9p u7RRzQtO/v5e69qwp19kI778Q6m8h+NQ8rMh4= MIME-Version: 1.0 Received: by 10.231.125.13 with SMTP id w13mr286556ibr.32.1255615661728; Thu, 15 Oct 2009 07:07:41 -0700 (PDT) In-Reply-To: <25905217.post@talk.nabble.com> References: <25905217.post@talk.nabble.com> Date: Thu, 15 Oct 2009 10:07:41 -0400 Message-ID: <359a92830910150707g21ddf4ffpf4684de93a967b60@mail.gmail.com> Subject: Re: search trough single pdf document - return page number From: Erick Erickson To: java-dev@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e64ea9287df77e0475f9cdab X-Virus-Checked: Checked by ClamAV on apache.org --0016e64ea9287df77e0475f9cdab Content-Type: text/plain; charset=ISO-8859-1 It depends (tm). Do you want to permanently index this content and search it multiple times or is each search a one-off? If the latter, I'd look for packages specific to handling PDF files. Although since Reader takes forever to search a document, so I suspect there's not much joy there. If you want to parse the file once and search it many times, then yes, Lucene can help a lot. You could conceivable do this in a memory index if you didn't want a permanent copy. In this scheme, you'd index the file before the first search then use the in-menory index until you were done searching (assuming you wanted to search for different terms multiple times). You'd have to do some record-keeping to remember what the start and end offset of each page was so you could deal with the case that a phrases you search for started on one page and ended on another..... If this is off base, perhaps you could provide more details... Erick On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago wrote: > > Hi, > > I have to search a single pdf document for requested string and if that > string is found, I need to return a page number where that string was > found. > Requested string can be anything in a pdf document. > > It is a big document(abount 5000 pages) so I'm asking if that is possible > with lucene. > > I'm using pdfbox class and i found a way to do it (searching with instring > page by page) but it is too slow: > > PDDocument pddDocument=PDDocument.load(f); > > PDFTextStripper textStripper=new PDFTextStripper(); > int lastpage = textStripper.getEndPage(); > String page= null; > int found= 0; > > for(int i=1; i textStripper.setStartPage(i); > textStripper.setEndPage(i); > > page = textStripper.getText(pddDocument); > > found = page .indexOf(searchtext); > > if (found>0) {returnpage= i; break;} > } > ---------------- > > Is there a way to speed up the search with lucene? Can I use indexing to > solve this problem? thanks. > > -- > View this message in context: > http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > --0016e64ea9287df77e0475f9cdab Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable It depends (tm). Do you want to permanently index this content and search i= t multiple times or is each search a one-off? If the latter, I'd look f= or packages specific to handling PDF files. Although since Reader takes for= ever to search a document, so I suspect there's not much joy there.
If you want to parse the file once and search it many times,= then yes, Lucene can help a lot. You could conceivable do this in a memory= index if you didn't want a permanent copy. In this scheme, you'd i= ndex the file before the first search then use the in-menory index until yo= u were done searching (assuming you wanted to search for different terms mu= ltiple times). You'd have to do some record-keeping to remember what th= e start and end offset of each page was so you could deal with the case tha= t a phrases you search for started on one page and ended on another.....

If this is off base, perhaps you could provide more det= ails...

Erick

On= Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <idraganj@gmail.com> wrote:

Hi,

I have to search a single pdf document for requested string and if that
string is found, I need to return a page number where that string was found= .
Requested string can be anything in a pdf document.

It is a big document(abount 5000 pages) so I'm asking if that is possib= le
with lucene.

I'm using pdfbox class and i found a way to do it (searching with instr= ing
page by page) but it is too slow:

=A0 =A0 =A0 =A0PDDocument pddDocument=3DPDDocument.load(f);

=A0 =A0 =A0 =A0PDFTextStripper textStripper=3Dnew PDFTextStripper();
=A0 =A0 =A0 =A0int lastpage =3D textStripper.getEndPage();
=A0 =A0 =A0 =A0String page=3D null;
=A0 =A0 =A0 =A0int found=3D 0;

=A0 =A0 =A0 =A0for(int i=3D1; i<lastpage ; i++){
=A0 =A0 =A0 =A0 =A0 =A0textStripper.setStartPage(i);
=A0 =A0 =A0 =A0 =A0 =A0textStripper.setEndPage(i);

=A0 =A0 =A0 =A0 =A0 =A0page =3D textStripper.getText(pddDocument);

=A0 =A0 =A0 =A0 =A0 =A0found =3D page .indexOf(searchtext);

=A0 =A0 =A0 =A0 =A0 =A0if (found>0) {returnpage=3D i; break;}
=A0 =A0 =A0 =A0}
----------------

Is there a way to speed up the search with lucene? Can I use indexing to solve this problem? thanks.

--
View this message in context: http://www.nabble.com/search-trough-single-pdf-document---retur= n-page-number-tp25905217p25905217.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


--0016e64ea9287df77e0475f9cdab--