lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: clean up html before indexing or add tags to ignore list
Date Thu, 13 May 2004 17:57:35 GMT
Lynx rulez!

	<http://www.blogscene.org/erik?q=lynx>


On May 13, 2004, at 12:05 PM, Andy Goodell wrote:

> If you are running linux, i recommend before indexing with lucene, you
> use the program lynx with the option -dump which dumps the formatted
> text without the tags, and runs really really fast in most cases.
>
> - andy g
>
> On Thu, 13 May 2004 03:46:37 -0700 (PDT), Otis Gospodnetic
> <otis_gospodnetic@yahoo.com> wrote:
>>
>> Clean up seems cleaner.  Just extract the textual information from 
>> HTML
>> using NekoHTML or JTidy or HTMLParser (.sf.net) or some such.
>>
>> You can also get fancy and preserve the 'structural' information (e.g.
>> H1 text is more important that H2, which is more important than BODY,
>> which is more important that DIV, etc.) and combine it with field
>> boosting at index time.
>>
>> Otis
>>
>>
>>
>> --- Sebastian Ho <sebastianh@bii.a-star.edu.sg> wrote:
>>> Hi
>>>
>>> This is a typical web crawler, indexing and search application
>>> development. I have wrote my crawler and planning to add lucene in
>>> next.
>>> One questions pop to my mind, in terms of performance, do i clean up
>>> the
>>> html removing all tags before indexing, or i add all tags into the
>>> ignore list during indexing/search stage.
>>>
>>> Which is better?
>>>
>>> Thanks
>>>
>>> Sebastian Ho
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message