lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Split single string into several fields?
Date Wed, 28 Oct 2009 09:28:59 GMT
Robert Muir wrote:
> Will, I think this parsing of documents into different fields, is separate
> and unrelated from lucene's analysis (tokenization)...
> the analysis comes to play once you have a field, and you want to break the
> text into indexable units (words, or entire field as token like your urls).
> i wouldn't suggest make a big complicated analyzer that tries to parse html
> in addition to breaking text into words, I would keep parsing and analysis
> separate.
> then i would handle different fields with different analyzers, i think Erick
> already mentioned PerFieldAnalyzerWrapper, its useful for this.

It's also possible to do the tokenization ahead of time, i.e. before you 
pass the document to IndexWriter. You can construct the TokenStream 
using your own analysis chain, and use Field.setTokenStreamValue() - 
this way you will index exactly the token stream you want, and you can 
even create other fields in the document (or split this token stream 
into several fields).

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message