lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From IvanDrago <idrag...@gmail.com>
Subject Re: search trough single pdf document - return page number
Date Thu, 15 Oct 2009 14:43:22 GMT

Thanks for the reply Erick.

I would like to permanently index this content and search it
multiple times so I would like a permanent copy and I want to search for
different terms multiple
times.

My problem is that I dont know how to retrieve a page number where the
searched string was found so
if you could help on that issue, that would be great.

// I would start like this:
// This part of code would create the index, right?
Document luceneDocument = LucenePDFDocument.getDocument( f );
IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(),
true);
iwriter.addDocument(luceneDocument);
iwriter.close();

//and now for the search:
Directory fsDir = FSDirectory.getDirectory(index_dir, false);
IndexSearcher ind_search = new IndexSearcher(fsDir);

//im not sure if "fieldname" would be the string that I'm searching?
QueryParser parser = new QueryParser("fieldname", new StandardAnalyzer());
Query query = parser.parse(q);

Hits hits = ind_search.search(query);

//and I'm stuck here. Dont know how to retrieve the page number???




 


Erick Erickson wrote:
> 
> It depends (tm). Do you want to permanently index this content and search
> it
> multiple times or is each search a one-off? If the latter, I'd look for
> packages specific to handling PDF files. Although since Reader takes
> forever
> to search a document, so I suspect there's not much joy there.
> If you want to parse the file once and search it many times, then yes,
> Lucene can help a lot. You could conceivable do this in a memory index if
> you didn't want a permanent copy. In this scheme, you'd index the file
> before the first search then use the in-menory index until you were done
> searching (assuming you wanted to search for different terms multiple
> times). You'd have to do some record-keeping to remember what the start
> and
> end offset of each page was so you could deal with the case that a phrases
> you search for started on one page and ended on another.....
> 
> If this is off base, perhaps you could provide more details...
> 
> Erick
> 
> On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <idraganj@gmail.com> wrote:
> 
>>
>> Hi,
>>
>> I have to search a single pdf document for requested string and if that
>> string is found, I need to return a page number where that string was
>> found.
>> Requested string can be anything in a pdf document.
>>
>> It is a big document(abount 5000 pages) so I'm asking if that is possible
>> with lucene.
>>
>> I'm using pdfbox class and i found a way to do it (searching with
>> instring
>> page by page) but it is too slow:
>>
>>        PDDocument pddDocument=PDDocument.load(f);
>>
>>        PDFTextStripper textStripper=new PDFTextStripper();
>>        int lastpage = textStripper.getEndPage();
>>        String page= null;
>>        int found= 0;
>>
>>        for(int i=1; i<lastpage ; i++){
>>            textStripper.setStartPage(i);
>>            textStripper.setEndPage(i);
>>
>>            page = textStripper.getText(pddDocument);
>>
>>            found = page .indexOf(searchtext);
>>
>>            if (found>0) {returnpage= i; break;}
>>        }
>> ----------------
>>
>> Is there a way to speed up the search with lucene? Can I use indexing to
>> solve this problem? thanks.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message