Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of erickerickson@gmail.com
 designates 209.85.223.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=fMvKMMxAXBVLsxrHeOyRGyGKCzo0SzrosZM4EsUEb9sBzVxdUQZrf6iylVMyUG74Qt
         Hqdwk0Fuf26hlXLqrxKsnnDm2zTdiMy2CtiTH0+NWHfoxLPMwMvu8+HMT1bI7QimdZ9p
         u7RRzQtO/v5e69qwp19kI778Q6m8h+NQ8rMh4=
MIME-Version: 1.0
In-Reply-To: <25905217.post@talk.nabble.com>
References: <25905217.post@talk.nabble.com>
Date: Thu, 15 Oct 2009 10:07:41 -0400
Message-ID: <359a92830910150707g21ddf4ffpf4684de93a967b60@mail.gmail.com>
Subject: Re: search trough single pdf document - return page number
From: Erick Erickson <erickerickson@gmail.com>
To: java-dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016e64ea9287df77e0475f9cdab

--0016e64ea9287df77e0475f9cdab
Content-Type: text/plain; charset=ISO-8859-1

It depends (tm). Do you want to permanently index this content and search it
multiple times or is each search a one-off? If the latter, I'd look for
packages specific to handling PDF files. Although since Reader takes forever
to search a document, so I suspect there's not much joy there.
If you want to parse the file once and search it many times, then yes,
Lucene can help a lot. You could conceivable do this in a memory index if
you didn't want a permanent copy. In this scheme, you'd index the file
before the first search then use the in-menory index until you were done
searching (assuming you wanted to search for different terms multiple
times). You'd have to do some record-keeping to remember what the start and
end offset of each page was so you could deal with the case that a phrases
you search for started on one page and ended on another.....

If this is off base, perhaps you could provide more details...

Erick

On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <idraganj@gmail.com> wrote:

>
> Hi,
>
> I have to search a single pdf document for requested string and if that
> string is found, I need to return a page number where that string was
> found.
> Requested string can be anything in a pdf document.
>
> It is a big document(abount 5000 pages) so I'm asking if that is possible
> with lucene.
>
> I'm using pdfbox class and i found a way to do it (searching with instring
> page by page) but it is too slow:
>
>        PDDocument pddDocument=PDDocument.load(f);
>
>        PDFTextStripper textStripper=new PDFTextStripper();
>        int lastpage = textStripper.getEndPage();
>        String page= null;
>        int found= 0;
>
>        for(int i=1; i<lastpage ; i++){
>            textStripper.setStartPage(i);
>            textStripper.setEndPage(i);
>
>            page = textStripper.getText(pddDocument);
>
>            found = page .indexOf(searchtext);
>
>            if (found>0) {returnpage= i; break;}
>        }
> ----------------
>
> Is there a way to speed up the search with lucene? Can I use indexing to
> solve this problem? thanks.
>
> --
> View this message in context:
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

--0016e64ea9287df77e0475f9cdab
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

It depends (tm). Do you want to permanently index this content and search i=
t multiple times or is each search a one-off? If the latter, I&#39;d look f=
or packages specific to handling PDF files. Although since Reader takes for=
ever to search a document, so I suspect there&#39;s not much joy there.<div=
>
<br></div><div>If you want to parse the file once and search it many times,=
 then yes, Lucene can help a lot. You could conceivable do this in a memory=
 index if you didn&#39;t want a permanent copy. In this scheme, you&#39;d i=
ndex the file before the first search then use the in-menory index until yo=
u were done searching (assuming you wanted to search for different terms mu=
ltiple times). You&#39;d have to do some record-keeping to remember what th=
e start and end offset of each page was so you could deal with the case tha=
t a phrases you search for started on one page and ended on another.....</d=
iv>
<div><br></div><div>If this is off base, perhaps you could provide more det=
ails...</div><div><br></div><div>Erick<br><br><div class=3D"gmail_quote">On=
 Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:idraganj@gmail.com">idraganj@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;"><br>
Hi,<br>
<br>
I have to search a single pdf document for requested string and if that<br>
string is found, I need to return a page number where that string was found=
.<br>
Requested string can be anything in a pdf document.<br>
<br>
It is a big document(abount 5000 pages) so I&#39;m asking if that is possib=
le<br>
with lucene.<br>
<br>
I&#39;m using pdfbox class and i found a way to do it (searching with instr=
ing<br>
page by page) but it is too slow:<br>
<br>
 =A0 =A0 =A0 =A0PDDocument pddDocument=3DPDDocument.load(f);<br>
<br>
 =A0 =A0 =A0 =A0PDFTextStripper textStripper=3Dnew PDFTextStripper();<br>
 =A0 =A0 =A0 =A0int lastpage =3D textStripper.getEndPage();<br>
 =A0 =A0 =A0 =A0String page=3D null;<br>
 =A0 =A0 =A0 =A0int found=3D 0;<br>
<br>
 =A0 =A0 =A0 =A0for(int i=3D1; i&lt;lastpage ; i++){<br>
 =A0 =A0 =A0 =A0 =A0 =A0textStripper.setStartPage(i);<br>
 =A0 =A0 =A0 =A0 =A0 =A0textStripper.setEndPage(i);<br>
<br>
 =A0 =A0 =A0 =A0 =A0 =A0page =3D textStripper.getText(pddDocument);<br>
<br>
 =A0 =A0 =A0 =A0 =A0 =A0found =3D page .indexOf(searchtext);<br>
<br>
 =A0 =A0 =A0 =A0 =A0 =A0if (found&gt;0) {returnpage=3D i; break;}<br>
 =A0 =A0 =A0 =A0}<br>
----------------<br>
<br>
Is there a way to speed up the search with lucene? Can I use indexing to<br=
>
solve this problem? thanks.<br>
<font color=3D"#888888"><br>
--<br>
View this message in context: <a href=3D"http://www.nabble.com/search-troug=
h-single-pdf-document---return-page-number-tp25905217p25905217.html" target=
=3D"_blank">http://www.nabble.com/search-trough-single-pdf-document---retur=
n-page-number-tp25905217p25905217.html</a><br>

Sent from the Lucene - Java Developer mailing list archive at Nabble.com.<b=
r>
<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:java-dev-unsubscribe@lucene.apach=
e.org">java-dev-unsubscribe@lucene.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:java-dev-help@lucene.apa=
che.org">java-dev-help@lucene.apache.org</a><br>
<br>
</font></blockquote></div><br></div>

--0016e64ea9287df77e0475f9cdab--