lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stone, Timothy" <tst...@cityofhbg.com>
Subject RE: Newbie quizzes further...
Date Tue, 03 Sep 2002 12:30:01 GMT
Joshua, et.al,

"You are correct sirs." But... ;)

I'm deploying the demo as I have very little time to come up to speed on the
API, JavaCC and other more powerful features of Lucene. (It's likely all the
reading I did over the weekend may make that moot, but I don't think I'm
going to be writing my own implementation this week... ) ;)

The demo actually does support entities, but I think something is askew (I'm
not sure if it's a bug).*

Specifically, see org.apache.lucene.demo.html.Entities. The class provides
straight forward encoding and decoding of the entities, including the entity
series in question (82xx). The Entities class appears to be directly used by
the HTMLParser.jj, in turn, used by IndexHTML. Since I don't know JavaCC,
I'm hoping that someone much more familiar with the demo and Lucene can
help. I have provided some idea of what is happening below.

Warmest Regards,
Tim


* The details...

My problem manifests itself in the the demo when returning results. I use
the 82xx series of decimal entities extensively throughout my site, in the
title, in the content and more. The returned results thus look like:

Document						Summary
Welcome to HarrisburgCity.com ? ...		To quote the Mayor, ? I have
seen...


Here the entity, "&#8212;", emdash, is used in the title, and the entity
"&#8220;", opening double quote, is used in the summary. Both appearing as
"?".

> -----Original Message-----
> From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
> Sent: Monday, September 02, 2002 20:36
> To: Lucene Users List
> Subject: Re: Newbie quizzes further...
> 
> 
> On Mon, 2 Sep 2002, Stone, Timothy wrote:
> 
> > I have noted that Lucene fails to interpret numerous HTML entities,
> > specifically entities in the 82xx range, i.e. &#8212; (en-dash) and
> > many others. Now this may not be a Lucene issue, I'm looking at the
> > code as I post, but I'm curious to its origins and why they 
> don't seem
> > to be parsed properly in the index.
> 
> As I see it, there are two answers to this question.
> 
> (1) What gets parsed and indexed is your choice; there are various
> different Analyzers that are included with the Lucene 
> package, which have
> different effects.  You could conceivably construct an 
> Analyzer that would
> parse such entities as you describe.
> 
> (2) Historically punctuation has not been parsed by search 
> engines, for
> the simple reason that it doesn't tend to add much to search 
> precision and
> it complicates the indexing process.  (On the other hand, if you're
> talking about accents and non-English letters, I understand that some
> people have written analyzers that cover these things; check out the
> contrib section on the Lucene website.)
> 
> Joshua O'Madadhain
> 
>  jmadden@ics.uci.edu...Obscurium Per 
> Obscurius...www.ics.uci.edu/~jmadden
>   Joshua O'Madadhain: Information Scientist, Musician, 
> Philosopher-At-Tall
>  It's that moment of dawning comprehension that I live 
> for--Bill Watterson
> My opinions are too rational and insightful to be those of 
> any organization.
> 
> 
> 
> 
> --
> To unsubscribe, e-mail:   
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: 
> <mailto:lucene-user-help@jakarta.apache.org>
> 

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message