lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Split single string into several fields?
Date Tue, 27 Oct 2009 23:17:34 GMT
Unless I don't understand at all what you're going for, wouldn't
it work to just put the HTML through some kind of parser (strict or
loose depending on how well-formed your HTML is), then just
extract the text from your document and push them into your
Lucene document? Various parsers make this more or less
simple...

Something like, for each document
domObj = parse(htmldocument);
Document lucDoc = new Document();

lucDoc.add("insideh1", domObj.getText(<dom path to H1>));

lucDoc.add("insideh1", domObj.getText(<dom path to title>));

lucDoc.add("alltext", <like above>);
lucDoc.add("alltext, <like above>);
.
.
.
<add document to lucene index>

HTH
Erick


On Tue, Oct 27, 2009 at 6:50 PM, Will Murnane <will.murnane@gmail.com>wrote:

> Hello list,
>  I have some semi-structured text that has some markup elements, and
> I want to put those elements into a separate field so I can search by
> them.  For example (using HTML syntax):
> ---- 8< ---- document
> <h1>Section title</h1>
> Body content
> ---- >8 ----
> I can find that the things inside <h1>s are "Section" and "title", and
> "Body" and "content" are outside.  I want to create two fields for
> this document:
> insideh1 -> "Section", "title"
> alltext -> "Section", "title", "Body", "content"
>
> What's the best way to approach this?  My initial thought is to make
> some kind of MultiAnalyzer that consumes the text and produces several
> token streams, which are added to the document one at a time.  Is that
> a reasonable strategy?
>
> Thanks!
> Will
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message