lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gy├Ârgy Frivolt <gyorgy.friv...@gmail.com>
Subject Indexing HTML document
Date Tue, 02 Mar 2010 16:07:02 GMT
Hi, How to index properly HTML documents? All the documents are HTML, some
containing charaters encodid like &#x17E;&#xED; ... Is there a character
filter for filtering these codes? Is there a way to strip the HTML tags out?
Does solr weight the terms in the document based on where they appear?..
words in headers (H1, H2,..) would be supposed to describe the document more
then words in paragraphs.

Thanks for help,

   Georg

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message