lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From IvanDrago <idrag...@gmail.com>
Subject Re: search trough single pdf document - return page number
Date Fri, 16 Oct 2009 12:27:34 GMT

proximity queries that span pages are not a concern in my case.

I asked another question on the bottom of my last post. Could you comment on
that If you have some ideas?


Erick Erickson wrote:
> 
> Glad things are progressing. The only problem here will be
> proximityqueries
> that span pages. Say, the last word on page 10 is
> "salmon" and the first word on page 11 is "fishing". Structuring
> your index this way won't find the a proximity search for "salmon
> fishing".
> 
> If that's not a concern, then there's no reason to complexify the
> situation..
> 
> FWIW
> Erick
> 
> On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago <idraganj@gmail.com> wrote:
> 
>>
>> Hey! I did it! Eric and Robert, you helped a lot. Thanks!
>>
>> I didn't use LucenePDFDocument. I created a new document for every page
>> in
>> a
>> PDF document and added paga number info for every page.
>>
>>        PDDocument pddDocument=PDDocument.load(f);
>>        PDFTextStripper textStripper=new PDFTextStripper();
>>
>>         IndexWriter iwriter = new IndexWriter(index_dir, new
>> StandardAnalyzer(), true);
>>
>>         long start = new Date().getTime();
>>
>>        // 350 pages just for test
>>        for(int i=1; i<350; i++){
>>            //System.out.println("i= " + i);
>>             textStripper.setStartPage(i);
>>            textStripper.setEndPage(i);
>>
>>             //fetch one page
>>            pagecontent = textStripper.getText(pddDocument);
>>            System.out.println("pagecontent: " + pagecontent);
>>
>>            if (pagecontent != null){
>>                System.out.println("i= " + i);
>>                    Document doc = new Document();
>>
>>                    // Add the pagenumber
>>                    doc.add(new Field("pagenumber", Integer.toString(i) ,
>> Field.Store.YES,
>>                            Field.Index.ANALYZED));
>>                    doc.add(new Field("content", pagecontent ,
>> Field.Store.NO,
>>                            Field.Index.ANALYZED));
>>
>>                        iwriter.addDocument(doc);
>>            }
>>
>>        }
>>
>>        // Optimize and close the writer to finish building the index
>>        iwriter.optimize();
>>            iwriter.close();
>>
>>        long end = new Date().getTime();
>>
>>        System.out.println("Indexing files took "
>>        + (end - start) + " milliseconds");
>>
>>        //just for test I searched for a string cryptography
>>        String q = "cryptography";
>>
>>        Directory fsDir = FSDirectory.getDirectory(index_dir, false);
>>         IndexSearcher ind_searcher = new IndexSearcher(fsDir);
>>
>>        // Build a Query object
>>        QueryParser parser = new QueryParser("content", new
>> StandardAnalyzer());
>>        Query query = parser.parse(q);
>>
>>         // Search for the query
>>        Hits hits = ind_searcher.search(query);
>>
>>        // Examine the Hits object to see if there were any matches
>>        int hitCount = hits.length();
>>        if (hitCount == 0) {
>>            System.out.println(
>>                "No matches were found for \"" + q + "\"");
>>        }
>>        else {
>>            System.out.println("Hits for \"" +
>>                q + "\" were found in pages:");
>>
>>            // Iterate over the Documents in the Hits object
>>            for (int i = 0; i < hitCount; i++) {
>>                Document doc = hits.doc(i);
>>
>>                // Print the value that we stored in the "title" field.
>> Note
>>                // that this Field was not indexed, but (unlike the
>>                // "contents" field) was stored verbatim and can be
>>                // retrieved.
>>                //System.out.println("  " + (i + 1) + ". " +
>> doc.get("title"));
>>                System.out.println("  " + (i + 1) + ". " +
>> doc.get("pagenumber"));
>>            }
>>        }
>>        ind_searcher.close();
>>
>> --------------------
>> I'm using lucene version 2.9.0
>> You said that Hits are deprecated. Should I use HitCollector instead?
>>
>> Another question came into my mind... What if I want do add another PDF
>> document to the search pool. Before search I would like to specify the
>> PDF
>> document I would like to search and then return page number for searched
>> String. I could create index for every document that I add to search pool
>> but that doesn't sound good to me? Can you think of a better way to do
>> that?
>>
>>
>> Erick Erickson wrote:
>> >
>> > Your search would be on the "contents" field if you use
>> LucenePDFDocument.
>> >
>> > But on a quick look, LucenePDFDocument doesn't give you any page
>> > information. So, you'd have to collect that somehow, but I don't see a
>> > clear
>> > way to.
>> >
>> > Doing it manually, you could do something like:
>> >
>> > Document doc = new Document();
>> > for (each page in the document) {
>> >   doc.add("contents", <text for page>);
>> >   record the offset of the last term in the page you just indexed);
>> > }
>> > doc.add("metadata", <string representation of the page offsets>);
>> > iw.addDocument(doc);
>> >
>> > Now, when you search you can get the offsets of the matching term,
>> > then look in your metadata field for the page number.
>> >
>> > Perhaps you could use the LucenePDFDocument in conjunction with this
>> > somehow, but I confess that I've never used it so it's not clear to me
>> how
>> > you'd do this.
>> >
>> > Incidentally, the Hits object is deprecated, what version of Lucene are
>> > you intending to use?
>> >
>> > Best
>> > Erick
>> >
>> > On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago <idraganj@gmail.com> wrote:
>> >
>> >>
>> >> Thanks for the reply Erick.
>> >>
>> >> I would like to permanently index this content and search it
>> >> multiple times so I would like a permanent copy and I want to search
>> for
>> >> different terms multiple
>> >> times.
>> >>
>> >> My problem is that I dont know how to retrieve a page number where the
>> >> searched string was found so
>> >> if you could help on that issue, that would be great.
>> >>
>> >> // I would start like this:
>> >> // This part of code would create the index, right?
>> >> Document luceneDocument = LucenePDFDocument.getDocument( f );
>> >> IndexWriter iwriter = new IndexWriter(index_dir, new
>> StandardAnalyzer(),
>> >> true);
>> >> iwriter.addDocument(luceneDocument);
>> >> iwriter.close();
>> >>
>> >> //and now for the search:
>> >> Directory fsDir = FSDirectory.getDirectory(index_dir, false);
>> >> IndexSearcher ind_search = new IndexSearcher(fsDir);
>> >>
>> >> //im not sure if "fieldname" would be the string that I'm searching?
>> >> QueryParser parser = new QueryParser("fieldname", new
>> >> StandardAnalyzer());
>> >> Query query = parser.parse(q);
>> >>
>> >> Hits hits = ind_search.search(query);
>> >>
>> >> //and I'm stuck here. Dont know how to retrieve the page number???
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Erick Erickson wrote:
>> >> >
>> >> > It depends (tm). Do you want to permanently index this content and
>> >> search
>> >> > it
>> >> > multiple times or is each search a one-off? If the latter, I'd look
>> for
>> >> > packages specific to handling PDF files. Although since Reader takes
>> >> > forever
>> >> > to search a document, so I suspect there's not much joy there.
>> >> > If you want to parse the file once and search it many times, then
>> yes,
>> >> > Lucene can help a lot. You could conceivable do this in a memory
>> index
>> >> if
>> >> > you didn't want a permanent copy. In this scheme, you'd index the
>> file
>> >> > before the first search then use the in-menory index until you were
>> >> done
>> >> > searching (assuming you wanted to search for different terms
>> multiple
>> >> > times). You'd have to do some record-keeping to remember what the
>> start
>> >> > and
>> >> > end offset of each page was so you could deal with the case that a
>> >> phrases
>> >> > you search for started on one page and ended on another.....
>> >> >
>> >> > If this is off base, perhaps you could provide more details...
>> >> >
>> >> > Erick
>> >> >
>> >> > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <idraganj@gmail.com>
>> wrote:
>> >> >
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I have to search a single pdf document for requested string and
if
>> >> that
>> >> >> string is found, I need to return a page number where that string
>> was
>> >> >> found.
>> >> >> Requested string can be anything in a pdf document.
>> >> >>
>> >> >> It is a big document(abount 5000 pages) so I'm asking if that is
>> >> possible
>> >> >> with lucene.
>> >> >>
>> >> >> I'm using pdfbox class and i found a way to do it (searching with
>> >> >> instring
>> >> >> page by page) but it is too slow:
>> >> >>
>> >> >>        PDDocument pddDocument=PDDocument.load(f);
>> >> >>
>> >> >>        PDFTextStripper textStripper=new PDFTextStripper();
>> >> >>        int lastpage = textStripper.getEndPage();
>> >> >>        String page= null;
>> >> >>        int found= 0;
>> >> >>
>> >> >>        for(int i=1; i<lastpage ; i++){
>> >> >>            textStripper.setStartPage(i);
>> >> >>            textStripper.setEndPage(i);
>> >> >>
>> >> >>            page = textStripper.getText(pddDocument);
>> >> >>
>> >> >>            found = page .indexOf(searchtext);
>> >> >>
>> >> >>            if (found>0) {returnpage= i; break;}
>> >> >>        }
>> >> >> ----------------
>> >> >>
>> >> >> Is there a way to speed up the search with lucene? Can I use
>> indexing
>> >> to
>> >> >> solve this problem? thanks.
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
>> >> >> Sent from the Lucene - Java Developer mailing list archive at
>> >> Nabble.com.
>> >> >>
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html
>> >> Sent from the Lucene - Java Developer mailing list archive at
>> Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25924250.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25924575.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message