lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: getting text-snippets
Date Thu, 23 Jun 2005 18:05:27 GMT
On Jun 23, 2005, at 10:17 AM, Ulrich Schinz wrote:
>> Field.Text(String, Reader) is not a stored field.  This is why  
>> doc.get("contents") is empty.
>>
>>
>
> ok, i read that in javadoc of lucene... in dont understand what  
> Field.Text(String,Reader,boolean) does... if i set boolean to true,
> what is the stortermvector??

Term vectors is additional index storage allowing you to find out how  
many of each term occurred in a field.  For your purposes, you don't  
need to use that feature.

>> You have some options... change to using a stored field by reading  
>> the file contents into a String and using Field.Text(String,  
>> String) instead.  Or, when rendering the results, go directly to  
>> the file pointed to by doc.get("filename") and read its contents  
>> then.  There are pros/cons to both of these approaches.
>>
>>
>
> ok, i started to try this... but i also try to index pdf-files.. so  
> i get an InputStream from pdftotext. if i try to convert that to a  
> String it takes really long time,
> and we have a lot of data to index....
> i tried different ways to get that done:
> 1.
> String ret = "";
> InputStream is=null;
> String[] cmd = {"/usr/bin/pdftotext", "test.pdf", "-"};
> byte[] buffer = new byte[80];
> child = Runtime.getRuntime().exec(cmd);
> is = child.getInputStream();
> BufferedInputStream bis = new BufferedInputStream(is,80);
> while(next != -1){
> ++t;
> next = bis.read(buffer,bis.pos, 80);
> String input = new String(buffer,0,next);
> ret += input;
> }
>
> not really that way, but conceptual (in real it compiles :-) )
> 2.
> String ret = "";
> InputStream is=null;
> String[] cmd = {"/usr/bin/pdftotext", "test.pdf", "-"};
> byte[] buffer = new byte[80];
> child = Runtime.getRuntime().exec(cmd);
> is = child.getInputStream();
> while(next != -1){
> ++t;
> next = is.read();
> ret += String(next);
> }
>
> but those versions are both really slow... it takes me more than 20  
> minutes (minimum) to get a pdf file of size 900 k...
> is there a way to get that faster???

You should consider using PDFBox for reading PDF files.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message