lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Split single string into several fields?
Date Wed, 28 Oct 2009 02:11:30 GMT
Not sure if it completely applies here, but you might also have a look  
at the TeeSinkTokenFilter in the contrib/analysis package.  It is  
designed to tee/sink tokens off from one main field to other fields.

On Oct 27, 2009, at 9:56 PM, Will Murnane wrote:

> On Tue, Oct 27, 2009 at 21:21, Jake Mannix <>  
> wrote:
>> On Tue, Oct 27, 2009 at 6:12 PM, Erick Erickson < 
>> >wrote:
>>> Could you go into your use case a bit more? Because I'm confused.
>>> Why don't you want your text tokenized? You say you want to search  
>>> it,
>>> which means you have to analyze it.
>> I think Will is suggesting that he doesn't want to have to analyze it
>> *again* -
>> if he really has different fields for every tag type, it would get
>> prohibitively
>> expensive in terms of Indexing CPU usage to retokenize over and over
>> again.
>> Is that what your concern is, Will?
> More or less.  Different types of tags need different tokenization:
> just as an example, I want to parse an img tag which contains a src
> attribute as a URL, and tokenize the URL as such (i.e., even if there
> are spaces they're treated as a unit), but the contents of a paragraph
> must be tokenized as English text.
> So I think the solution (because there's only one Analyzer per
> IndexWriter, and thus per document) is to do all the
> field-type-specific stuff outside of Lucene, and then use a very
> generic Analyzer, like the "\0"-splitter mentioned above.
> On Tue, Oct 27, 2009 at 21:12, Erick Erickson  
> <> wrote:
>> If you need different analyzers for each field, see  
>> PerFieldAnalyzerWrapper.
> That's very close to what I need, but I don't think it lines up quite
> right.  When I find some tokens inside an h1 tag (assume for
> simplicity that I only need to consider the innermost tag around a
> particular element) they won't be in the category for
> things-inside-h2-tags.  So I think trying to find all the things that
> are in h1 tags in one pass through the DOM tree, then things in h2
> tags in another, and so forth, will be slower than traversing the tree
> once and filing everything in its place myself, then feeding each list
> into Lucene as a field.
> So, in other words, I think using an individual Analyzer for each type
> of tag will be inefficient, so I'll run one big Analyzer, then put its
> results into Lucene.
> Will
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message