Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Mime-Version: 1.0 (Apple Message framework v730)
In-Reply-To: <41F0759F-D92E-4DC4-A4DB-7C2FBB9E682E@ehatchersolutions.com>
References: <89FA3553-6FBC-41B3-A289-C3C6253AA775@schinz.de>
 <41F0759F-D92E-4DC4-A4DB-7C2FBB9E682E@ehatchersolutions.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <0932BE11-F10F-4106-9D18-7DE4C8EC95C1@schinz.de>
Content-Transfer-Encoding: 7bit
From: Ulrich Schinz <uli@schinz.de>
Subject: Re: getting text-snippets
Date: Thu, 23 Jun 2005 16:17:11 +0200
To: java-user@lucene.apache.org

> Field.Text(String, Reader) is not a stored field.  This is why  
> doc.get("contents") is empty.
>

ok, i read that in javadoc of lucene... in dont understand what  
Field.Text(String,Reader,boolean) does... if i set boolean to true,
what is the stortermvector??

>
> You have some options... change to using a stored field by reading  
> the file contents into a String and using Field.Text(String,  
> String) instead.  Or, when rendering the results, go directly to  
> the file pointed to by doc.get("filename") and read its contents  
> then.  There are pros/cons to both of these approaches.
>

ok, i started to try this... but i also try to index pdf-files.. so i  
get an InputStream from pdftotext. if i try to convert that to a  
String it takes really long time,
and we have a lot of data to index....
i tried different ways to get that done:
1.
String ret = "";
InputStream is=null;
String[] cmd = {"/usr/bin/pdftotext", "test.pdf", "-"};
byte[] buffer = new byte[80];
child = Runtime.getRuntime().exec(cmd);
is = child.getInputStream();
BufferedInputStream bis = new BufferedInputStream(is,80);
while(next != -1){
++t;
next = bis.read(buffer,bis.pos, 80);
String input = new String(buffer,0,next);
ret += input;
}

not really that way, but conceptual (in real it compiles :-) )
2.
String ret = "";
InputStream is=null;
String[] cmd = {"/usr/bin/pdftotext", "test.pdf", "-"};
byte[] buffer = new byte[80];
child = Runtime.getRuntime().exec(cmd);
is = child.getInputStream();
while(next != -1){
++t;
next = is.read();
ret += String(next);
}

but those versions are both really slow... it takes me more than 20  
minutes (minimum) to get a pdf file of size 900 k...
is there a way to get that faster???


regards,
uli


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org