lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Split single string into several fields?
Date Wed, 28 Oct 2009 02:44:52 GMT
Will, I think this parsing of documents into different fields, is separate
and unrelated from lucene's analysis (tokenization)...
the analysis comes to play once you have a field, and you want to break the
text into indexable units (words, or entire field as token like your urls).

i wouldn't suggest make a big complicated analyzer that tries to parse html
in addition to breaking text into words, I would keep parsing and analysis
separate.
then i would handle different fields with different analyzers, i think Erick
already mentioned PerFieldAnalyzerWrapper, its useful for this.

if there is some performance consideration... it seems you are worried about
this wrt parsing, not actual analysis.... then maybe use the sax api?

On Tue, Oct 27, 2009 at 9:56 PM, Will Murnane <will.murnane@gmail.com>wrote:

> On Tue, Oct 27, 2009 at 21:21, Jake Mannix <jake.mannix@gmail.com> wrote:
> > On Tue, Oct 27, 2009 at 6:12 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
> >
> >> Could you go into your use case a bit more? Because I'm confused.
> >> Why don't you want your text tokenized? You say you want to search it,
> >> which means you have to analyze it.
> >
> >
> > I think Will is suggesting that he doesn't want to have to analyze it
> > *again* -
> > if he really has different fields for every tag type, it would get
> > prohibitively
> > expensive in terms of Indexing CPU usage to retokenize over and over
> > again.
> >
> > Is that what your concern is, Will?
> More or less.  Different types of tags need different tokenization:
> just as an example, I want to parse an img tag which contains a src
> attribute as a URL, and tokenize the URL as such (i.e., even if there
> are spaces they're treated as a unit), but the contents of a paragraph
> must be tokenized as English text.
>
> So I think the solution (because there's only one Analyzer per
> IndexWriter, and thus per document) is to do all the
> field-type-specific stuff outside of Lucene, and then use a very
> generic Analyzer, like the "\0"-splitter mentioned above.
>
> On Tue, Oct 27, 2009 at 21:12, Erick Erickson <erickerickson@gmail.com>
> wrote:
> > If you need different analyzers for each field, see
> PerFieldAnalyzerWrapper.
>
> That's very close to what I need, but I don't think it lines up quite
> right.  When I find some tokens inside an h1 tag (assume for
> simplicity that I only need to consider the innermost tag around a
> particular element) they won't be in the category for
> things-inside-h2-tags.  So I think trying to find all the things that
> are in h1 tags in one pass through the DOM tree, then things in h2
> tags in another, and so forth, will be slower than traversing the tree
> once and filing everything in its place myself, then feeding each list
> into Lucene as a field.
>
> So, in other words, I think using an individual Analyzer for each type
> of tag will be inefficient, so I'll run one big Analyzer, then put its
> results into Lucene.
>
> Will
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message