lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <t...@hypermall.net>
Subject Re: org.apache.lucene.demo.IndexHTML - parse JSP files?
Date Tue, 25 Mar 2003 14:46:02 GMT
On Monday 24 March 2003 18:03, Michael Wechner wrote:
> John Bresnik wrote:
> >anyone know of a quick and easy way to get this demo
> >[org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a
> >crawler to create a local [static] version of the site [i.e. they are not
> >longer "JSP" files just the html output from the original JSP file  - but
> > in the interest of keeping the URL intact, I need to parse the JSP
> > extentions - the short question is, does anyone know of a way to *not*
> > ignore the *.jsp files?
>
> just modify IndexHTML: there is one line in there which decides what
> extension it will index.

There is another question I was wondering; since JSP is not XML (ie. can not 
be reliably parse using an XML or even HTML parser [or for that matter, even 
with simplest XML markup tokenizer that ignores nesting], needs a lower level 
scanner), has anyone tried connecting an actual JSP processor to Lucene? Or 
writing a simple one just meant for indexing, without having to execute code 
embedded?
[the problem with JSP compared to XML is that it need not nest properly with 
HTML content around; one can use JSP inside attribute values, for example; 
thus, first JSP has to be processed to HTML, and then HTML needs to be 
further tokenized]

Jakarta has to have at least one such processor (haven't looked at whether 
there's a separate component or if Tomcat just has one embedded?). Of course 
parsing JSP is problematic in many ways, not just getting jsp tagging out; 
dynamic portions probably just have to be ignored, and all text inside 
included (except for things inside comments).

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message