lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From blaplante <blapla...@netwebapps.com>
Subject Re: Re: Corrupt index problem
Date Tue, 06 May 2003 16:24:34 GMT
yes I am using the org.pdfbox.searchengines.LucenePDFDocument class and I
did like the others and put in fork logic to call either HTMLParser or
LucenePDFDocument to index the InputStream. The strange thing is there are
about 8 pdf's in the site and they seem to be the only thing that gets
indexed. I am using eclipse and watching the debugger to make sure the
correct parser is getting called and I also watch the index grow durring the
index stage and shrink when the optomize is called, yet when I put in search
criteria I get no results unless I put in a single alpha character [a-zA-Z]
and then the results I spoke of earlier. Once I get this working I would be
happy the share my code that crawls a site using https, it looks like that
one is missing from the sandbox. 

>On Tue, 6 May 2003 11:58:07 -0400 (EDT) Ben Litchfield <ben@csh.rit.edu>
wrote.
>Are you using the org.pdfbox.searchengines.LucenePDFDocument class?  It

>looks like you are using the HTMLParser to parse a PDF document which will

>not work.  So in your indexer class you need to have something to this

>effect

>

>if( PDF Document )

>{

>    use LucenePDFDocument

>}

>else if( HTML Document )

>{

>    use HTMLParser

>}

>

>You can see a working example of this in the

>org.pdfbox.searchengine.lucene.IndexFiles source file.

>

>It would be nice if lucene had some sort Front class that implemented a

>command pattern.  So you could feed it a URL or File it would determine

>the mime type and it would call to a parser configured for that mimetype.

>It seems silly that everybody has all of these if statements in all of

>their indexers to switch on the contentType.  Maybe I will work on that

>one of these days if there is some interest.

>

>Ben

>

>

>On Tue, 6 May 2003, blaplante wrote:

>

>> I am not sure if I might be doing something wrong but the result of

>> searching the index produces a summary field that looks like the

>following.

>>

>> F-1.2 ÏÓ 10 0 obj << /Length 11 0 R /Filter /FlateDecode >> stream

>> H?µWÛnÜHýÿC=&?X®«.oÛñeá

>>
¾?ÝN0?_äî²­?ºå?Ôvü÷KÖM¥NË3?Ì"Ð??Å"yxxøy¾'?HDJT?Èüh?üÓ>?½?Jò$?ðß÷{4¡?sø¹

>> ûø[2øýB>Ü\ÍæÇGäúâdþ}vuLð?

>>

>> I am using:

>> lucene-1.2.jar

>> PDFBox-0.6.2.jar

>>

>> I wrote a class that formulates a dynamic url and uses https to connect

>via

>> the web and retrieve an InputStream using a GET_METHOD. I am passing the

>> InputStream off the HTMLParser(InputStream stream) constructor and
calling

>> parser.getReader() to store the content into the index and I am calling

>> parser.getSummary() to store the summary into the index. When I used the

>> demo that came with lucene and did a local index on my own machine the

>> HTMLParser(File file) constructor worked just fine. If anyone has idea's

>as

>> to what I might be doing wrong I would appreciate the heads up.

>>

>> Thanks

>>

>> Bryan LaPlante

>>

>> ---------------------------------------------------------------------

>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org

>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org

>>

>

>---------------------------------------------------------------------

>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org

>For additional commands, e-mail: lucene-user-help@jakarta.apache.org

>

>

>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message