lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Aleksandrovsky <>
Subject header/footer identification and general scaping tools
Date Mon, 28 Jun 2010 20:06:24 GMT
I was wondering if any of you know of any open-source solutions for general
issues which arise in web crawling - how do you remove
headers/footers/javascript and generally cleanup html of a web-page before
indexing? We have a first-pass solution implemented using custom code, but
this must be a problem which a lot of people face, so I am asking here.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message