Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 33224 invoked from network); 23 Jun 2005 14:31:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 23 Jun 2005 14:31:11 -0000 Received: (qmail 87321 invoked by uid 500); 23 Jun 2005 14:31:04 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 87256 invoked by uid 500); 23 Jun 2005 14:31:02 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Delivered-To: moderator for java-user@lucene.apache.org Received: (qmail 57588 invoked by uid 99); 23 Jun 2005 14:16:45 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Mime-Version: 1.0 (Apple Message framework v730) In-Reply-To: <41F0759F-D92E-4DC4-A4DB-7C2FBB9E682E@ehatchersolutions.com> References: <89FA3553-6FBC-41B3-A289-C3C6253AA775@schinz.de> <41F0759F-D92E-4DC4-A4DB-7C2FBB9E682E@ehatchersolutions.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <0932BE11-F10F-4106-9D18-7DE4C8EC95C1@schinz.de> Content-Transfer-Encoding: 7bit From: Ulrich Schinz Subject: Re: getting text-snippets Date: Thu, 23 Jun 2005 16:17:11 +0200 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.730) X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N > Field.Text(String, Reader) is not a stored field. This is why > doc.get("contents") is empty. > ok, i read that in javadoc of lucene... in dont understand what Field.Text(String,Reader,boolean) does... if i set boolean to true, what is the stortermvector?? > > You have some options... change to using a stored field by reading > the file contents into a String and using Field.Text(String, > String) instead. Or, when rendering the results, go directly to > the file pointed to by doc.get("filename") and read its contents > then. There are pros/cons to both of these approaches. > ok, i started to try this... but i also try to index pdf-files.. so i get an InputStream from pdftotext. if i try to convert that to a String it takes really long time, and we have a lot of data to index.... i tried different ways to get that done: 1. String ret = ""; InputStream is=null; String[] cmd = {"/usr/bin/pdftotext", "test.pdf", "-"}; byte[] buffer = new byte[80]; child = Runtime.getRuntime().exec(cmd); is = child.getInputStream(); BufferedInputStream bis = new BufferedInputStream(is,80); while(next != -1){ ++t; next = bis.read(buffer,bis.pos, 80); String input = new String(buffer,0,next); ret += input; } not really that way, but conceptual (in real it compiles :-) ) 2. String ret = ""; InputStream is=null; String[] cmd = {"/usr/bin/pdftotext", "test.pdf", "-"}; byte[] buffer = new byte[80]; child = Runtime.getRuntime().exec(cmd); is = child.getInputStream(); while(next != -1){ ++t; next = is.read(); ret += String(next); } but those versions are both really slow... it takes me more than 20 minutes (minimum) to get a pdf file of size 900 k... is there a way to get that faster??? regards, uli --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org