Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 30076 invoked from network); 12 Dec 2008 08:34:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Dec 2008 08:34:58 -0000 Received: (qmail 43036 invoked by uid 500); 12 Dec 2008 08:35:04 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 43006 invoked by uid 500); 12 Dec 2008 08:35:04 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 42995 invoked by uid 99); 12 Dec 2008 08:35:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Dec 2008 00:35:04 -0800 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=DNS_FROM_OPENWHOIS,SPF_HELO_PASS,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Dec 2008 08:34:49 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1LB3Tg-0006e3-Px for java-user@lucene.apache.org; Fri, 12 Dec 2008 00:34:28 -0800 Message-ID: <20971377.post@talk.nabble.com> Date: Fri, 12 Dec 2008 00:34:28 -0800 (PST) From: maxmil To: java-user@lucene.apache.org Subject: Beginner: Best way to index and display orginal text of pdfs in search results MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: mail@alwayssunny.com X-Virus-Checked: Checked by ClamAV on apache.org Hi, This is the first time i am using Lucene. I need to index pdf's with very few fields, title, date and body (long field) for a web based search. The results i need to display have to show not only the documents found but for each document a snapshot of the text where the search term has been found. This is analogous to the way google displays search results. That is to say ... some words and first instance of search Term and some more words ... some more words second instance of search term and some more words... etc. To do this i would need the original text of the document for each hit. As far as i understand Lucene does not save the original text of the document in the index. I am not using a database and would prefer not to have to store the original document text elsewhere. One way i could do this would be to take the hits from Lucene and reopen each pdf to extract the original text at run time however i fear that with many results this would be very slow. What would you recommend me to do? Thanks max -- View this message in context: http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-text-of-pdfs-in-search-results-tp20971377p20971377.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org