lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: search trough single pdf document - return page number
Date Fri, 16 Oct 2009 12:33:16 GMT
Well, you have to add another field to each document identifying thePDF it
came from. From there, restricting to that doc just becomes
adding an AND clause. Of course how you specify these is "an
exercise left to the reader" <G>.

Erick

On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago <idraganj@gmail.com> wrote:

>
> Hey! I did it! Eric and Robert, you helped a lot. Thanks!
>
> I didn't use LucenePDFDocument. I created a new document for every page in
> a
> PDF document and added paga number info for every page.
>
>        PDDocument pddDocument=PDDocument.load(f);
>        PDFTextStripper textStripper=new PDFTextStripper();
>
>         IndexWriter iwriter = new IndexWriter(index_dir, new
> StandardAnalyzer(), true);
>
>         long start = new Date().getTime();
>
>        // 350 pages just for test
>        for(int i=1; i<350; i++){
>            //System.out.println("i= " + i);
>             textStripper.setStartPage(i);
>            textStripper.setEndPage(i);
>
>             //fetch one page
>            pagecontent = textStripper.getText(pddDocument);
>            System.out.println("pagecontent: " + pagecontent);
>
>            if (pagecontent != null){
>                System.out.println("i= " + i);
>                    Document doc = new Document();
>
>                    // Add the pagenumber
>                    doc.add(new Field("pagenumber", Integer.toString(i) ,
> Field.Store.YES,
>                            Field.Index.ANALYZED));
>                    doc.add(new Field("content", pagecontent ,
> Field.Store.NO,
>                            Field.Index.ANALYZED));
>
>                        iwriter.addDocument(doc);
>            }
>
>        }
>
>        // Optimize and close the writer to finish building the index
>        iwriter.optimize();
>            iwriter.close();
>
>        long end = new Date().getTime();
>
>        System.out.println("Indexing files took "
>        + (end - start) + " milliseconds");
>
>        //just for test I searched for a string cryptography
>        String q = "cryptography";
>
>        Directory fsDir = FSDirectory.getDirectory(index_dir, false);
>         IndexSearcher ind_searcher = new IndexSearcher(fsDir);
>
>        // Build a Query object
>        QueryParser parser = new QueryParser("content", new
> StandardAnalyzer());
>        Query query = parser.parse(q);
>
>         // Search for the query
>        Hits hits = ind_searcher.search(query);
>
>        // Examine the Hits object to see if there were any matches
>        int hitCount = hits.length();
>        if (hitCount == 0) {
>            System.out.println(
>                "No matches were found for \"" + q + "\"");
>        }
>        else {
>            System.out.println("Hits for \"" +
>                q + "\" were found in pages:");
>
>            // Iterate over the Documents in the Hits object
>            for (int i = 0; i < hitCount; i++) {
>                Document doc = hits.doc(i);
>
>                // Print the value that we stored in the "title" field. Note
>                // that this Field was not indexed, but (unlike the
>                // "contents" field) was stored verbatim and can be
>                // retrieved.
>                //System.out.println("  " + (i + 1) + ". " +
> doc.get("title"));
>                System.out.println("  " + (i + 1) + ". " +
> doc.get("pagenumber"));
>            }
>        }
>        ind_searcher.close();
>
> --------------------
> I'm using lucene version 2.9.0
> You said that Hits are deprecated. Should I use HitCollector instead?
>
> Another question came into my mind... What if I want do add another PDF
> document to the search pool. Before search I would like to specify the PDF
> document I would like to search and then return page number for searched
> String. I could create index for every document that I add to search pool
> but that doesn't sound good to me? Can you think of a better way to do
> that?
>
>
> Erick Erickson wrote:
> >
> > Your search would be on the "contents" field if you use
> LucenePDFDocument.
> >
> > But on a quick look, LucenePDFDocument doesn't give you any page
> > information. So, you'd have to collect that somehow, but I don't see a
> > clear
> > way to.
> >
> > Doing it manually, you could do something like:
> >
> > Document doc = new Document();
> > for (each page in the document) {
> >   doc.add("contents", <text for page>);
> >   record the offset of the last term in the page you just indexed);
> > }
> > doc.add("metadata", <string representation of the page offsets>);
> > iw.addDocument(doc);
> >
> > Now, when you search you can get the offsets of the matching term,
> > then look in your metadata field for the page number.
> >
> > Perhaps you could use the LucenePDFDocument in conjunction with this
> > somehow, but I confess that I've never used it so it's not clear to me
> how
> > you'd do this.
> >
> > Incidentally, the Hits object is deprecated, what version of Lucene are
> > you intending to use?
> >
> > Best
> > Erick
> >
> > On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago <idraganj@gmail.com> wrote:
> >
> >>
> >> Thanks for the reply Erick.
> >>
> >> I would like to permanently index this content and search it
> >> multiple times so I would like a permanent copy and I want to search for
> >> different terms multiple
> >> times.
> >>
> >> My problem is that I dont know how to retrieve a page number where the
> >> searched string was found so
> >> if you could help on that issue, that would be great.
> >>
> >> // I would start like this:
> >> // This part of code would create the index, right?
> >> Document luceneDocument = LucenePDFDocument.getDocument( f );
> >> IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(),
> >> true);
> >> iwriter.addDocument(luceneDocument);
> >> iwriter.close();
> >>
> >> //and now for the search:
> >> Directory fsDir = FSDirectory.getDirectory(index_dir, false);
> >> IndexSearcher ind_search = new IndexSearcher(fsDir);
> >>
> >> //im not sure if "fieldname" would be the string that I'm searching?
> >> QueryParser parser = new QueryParser("fieldname", new
> >> StandardAnalyzer());
> >> Query query = parser.parse(q);
> >>
> >> Hits hits = ind_search.search(query);
> >>
> >> //and I'm stuck here. Dont know how to retrieve the page number???
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Erick Erickson wrote:
> >> >
> >> > It depends (tm). Do you want to permanently index this content and
> >> search
> >> > it
> >> > multiple times or is each search a one-off? If the latter, I'd look
> for
> >> > packages specific to handling PDF files. Although since Reader takes
> >> > forever
> >> > to search a document, so I suspect there's not much joy there.
> >> > If you want to parse the file once and search it many times, then yes,
> >> > Lucene can help a lot. You could conceivable do this in a memory index
> >> if
> >> > you didn't want a permanent copy. In this scheme, you'd index the file
> >> > before the first search then use the in-menory index until you were
> >> done
> >> > searching (assuming you wanted to search for different terms multiple
> >> > times). You'd have to do some record-keeping to remember what the
> start
> >> > and
> >> > end offset of each page was so you could deal with the case that a
> >> phrases
> >> > you search for started on one page and ended on another.....
> >> >
> >> > If this is off base, perhaps you could provide more details...
> >> >
> >> > Erick
> >> >
> >> > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <idraganj@gmail.com>
> wrote:
> >> >
> >> >>
> >> >> Hi,
> >> >>
> >> >> I have to search a single pdf document for requested string and if
> >> that
> >> >> string is found, I need to return a page number where that string was
> >> >> found.
> >> >> Requested string can be anything in a pdf document.
> >> >>
> >> >> It is a big document(abount 5000 pages) so I'm asking if that is
> >> possible
> >> >> with lucene.
> >> >>
> >> >> I'm using pdfbox class and i found a way to do it (searching with
> >> >> instring
> >> >> page by page) but it is too slow:
> >> >>
> >> >>        PDDocument pddDocument=PDDocument.load(f);
> >> >>
> >> >>        PDFTextStripper textStripper=new PDFTextStripper();
> >> >>        int lastpage = textStripper.getEndPage();
> >> >>        String page= null;
> >> >>        int found= 0;
> >> >>
> >> >>        for(int i=1; i<lastpage ; i++){
> >> >>            textStripper.setStartPage(i);
> >> >>            textStripper.setEndPage(i);
> >> >>
> >> >>            page = textStripper.getText(pddDocument);
> >> >>
> >> >>            found = page .indexOf(searchtext);
> >> >>
> >> >>            if (found>0) {returnpage= i; break;}
> >> >>        }
> >> >> ----------------
> >> >>
> >> >> Is there a way to speed up the search with lucene? Can I use indexing
> >> to
> >> >> solve this problem? thanks.
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
> >> >> Sent from the Lucene - Java Developer mailing list archive at
> >> Nabble.com.
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25924250.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message