lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Split single string into several fields?
Date Wed, 28 Oct 2009 01:12:50 GMT
Could you go into your use case a bit more? Because I'm confused.
Why don't you want your text tokenized? You say you want to search it,
which means you have to analyze it. All I'm suggesting is passing the text
from whatever HTML element into the analyzer, without the surrounding
markup. I'm suggesting that you might be able to use the analyzers
Lucene provides and just pass in text strings, without any need to create
your own analyzer.

If you need different analyzers for each field, see PerFieldAnalyzerWrapper.

Best
Erick

On Tue, Oct 27, 2009 at 8:44 PM, Will Murnane <will.murnane@gmail.com>wrote:

> On Tue, Oct 27, 2009 at 19:17, Erick Erickson <erickerickson@gmail.com>
> wrote:
> > Unless I don't understand at all what you're going for, wouldn't
> > it work to just put the HTML through some kind of parser (strict or
> > loose depending on how well-formed your HTML is), then just
> > extract the text from your document and push them into your
> > Lucene document? Various parsers make this more or less
> > simple...
> That's more or less what I was suggesting.  The problem as I see it is
> that Lucene wants to do its own tokenizing step.  I declared my
> IndexWriter like this:
> writer = new IndexWriter(IndexDirectory, new MySpecialAnalyzer(),
> true, MaxFieldLength.UNLIMITED);
> and the code in the MySpecialAnalyzer class is indeed called later on.
>
> So, I think this approach:
> > domObj = parse(htmldocument);
> > Document lucDoc = new Document();
> > lucDoc.add("insideh1", domObj.getText(<dom path to H1>));
> (etc) won't work, because when I put that text in it'll be analyzed again.
>
> Perhaps I'll write a ZeroSplittingAnalyzer or something, do all the
> work before I give anything to Lucene, then '\0'-join my tokens and
> feed them to the simple analyzer.  So something like this:
> Document doc = new Document();
> doc.add(new Field("h1", "hello\0world"));
> doc.add(new Field("alltext", "hello\0world\0goodnight\0moon"));
>
> I think that makes sense.  Comments?
>
> Will
>
> >
> > HTH
> > Erick
> >
> >
> > On Tue, Oct 27, 2009 at 6:50 PM, Will Murnane <will.murnane@gmail.com
> >wrote:
> >
> >> Hello list,
> >>  I have some semi-structured text that has some markup elements, and
> >> I want to put those elements into a separate field so I can search by
> >> them.  For example (using HTML syntax):
> >> ---- 8< ---- document
> >> <h1>Section title</h1>
> >> Body content
> >> ---- >8 ----
> >> I can find that the things inside <h1>s are "Section" and "title", and
> >> "Body" and "content" are outside.  I want to create two fields for
> >> this document:
> >> insideh1 -> "Section", "title"
> >> alltext -> "Section", "title", "Body", "content"
> >>
> >> What's the best way to approach this?  My initial thought is to make
> >> some kind of MultiAnalyzer that consumes the text and produces several
> >> token streams, which are added to the document one at a time.  Is that
> >> a reasonable strategy?
> >>
> >> Thanks!
> >> Will
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message