lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will Murnane <>
Subject Re: Split single string into several fields?
Date Wed, 28 Oct 2009 01:56:39 GMT
On Tue, Oct 27, 2009 at 21:21, Jake Mannix <> wrote:
> On Tue, Oct 27, 2009 at 6:12 PM, Erick Erickson <>wrote:
>> Could you go into your use case a bit more? Because I'm confused.
>> Why don't you want your text tokenized? You say you want to search it,
>> which means you have to analyze it.
> I think Will is suggesting that he doesn't want to have to analyze it
> *again* -
> if he really has different fields for every tag type, it would get
> prohibitively
> expensive in terms of Indexing CPU usage to retokenize over and over
> again.
> Is that what your concern is, Will?
More or less.  Different types of tags need different tokenization:
just as an example, I want to parse an img tag which contains a src
attribute as a URL, and tokenize the URL as such (i.e., even if there
are spaces they're treated as a unit), but the contents of a paragraph
must be tokenized as English text.

So I think the solution (because there's only one Analyzer per
IndexWriter, and thus per document) is to do all the
field-type-specific stuff outside of Lucene, and then use a very
generic Analyzer, like the "\0"-splitter mentioned above.

On Tue, Oct 27, 2009 at 21:12, Erick Erickson <> wrote:
> If you need different analyzers for each field, see PerFieldAnalyzerWrapper.

That's very close to what I need, but I don't think it lines up quite
right.  When I find some tokens inside an h1 tag (assume for
simplicity that I only need to consider the innermost tag around a
particular element) they won't be in the category for
things-inside-h2-tags.  So I think trying to find all the things that
are in h1 tags in one pass through the DOM tree, then things in h2
tags in another, and so forth, will be slower than traversing the tree
once and filing everything in its place myself, then feeding each list
into Lucene as a field.

So, in other words, I think using an individual Analyzer for each type
of tag will be inefficient, so I'll run one big Analyzer, then put its
results into Lucene.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message