lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <li...@ehatchersolutions.com>
Subject Re: Heuristics on searching HTML Documents ?
Date Mon, 30 Dec 2002 14:01:56 GMT
If you have control over the HTML, how about marking the navbar pieces 
with a certain CSS class and then filtering that out from what you 
index?  It seems like that would be a reasonable way to filter it - but 
this is of course provided its your HTML and not someone elses.

	Erik

On Monday, December 30, 2002, at 05:58  AM, Mailing Lists Account wrote:

> Hi,
>
> We use Lucene to index and search HTML Documents.  We extract
> all text content from the html documents and index it.
> While searching the documents, we found in several instances that
> search terms matched are in navbar section. Since it is in navbar, 
> almost
> all pages in that site end up in search result.
>
> Was wondering if there are any documented methods/heuristics to avoid
> searching certain portions of HTML document such as Navbars and 
> footers.
>
> Technically, it is all HTML, so I assume that there is no 
> straight-forward
> method to
> do that.  I observed that search engines like Google donot do anything 
> like
> the above
> and end up searching navbar and footer portions of the page too.
>
> I also understand that even if there are some heuristics, they are not
> likely to work with
> all html pages.
>
> Since navbar items are typically links, is it feasible to attach some
> weightage to different
> fragments of the text as it is retrieved (For e.g., if a text fragment 
> is
> part of a link, give low priority
> compared to other fragments) and index accordingly ?
>
> Any pointers/clues ? Has some research been done on this subject ?
>
> thanks
> Ramesh
>
>
>
>
> --
> To unsubscribe, e-mail:   
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: 
> <mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message