lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: Help Needed...
Date Thu, 28 May 2009 10:33:31 GMT

you'll have to make your own documents with after parsing yourself the  
HTML (e.g. with Nekohtml to dom).
As for the weights of tokens, supplementarily to IDF, you can do that  
per field, i.e. when you add a field into the document.


Le 28-mai-09 à 12:22, Gaurav Kumar a écrit :

> Hi everyone,
> I am doing a project using Lucene where i need to index HTML files.  
> I am
> using Tika to parse HTML files. But i need to index files according  
> to their
> tags which means that every text present in different HTML tag (like  
> <p>
> <a>) should be stored in different fields. Can i do that. If yes  
> how? Also
> can i assign different weightage to the tokens present in different  
> fields.
> If yes how?

View raw message