lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will Murnane <>
Subject Re: Split single string into several fields?
Date Wed, 28 Oct 2009 00:44:46 GMT
On Tue, Oct 27, 2009 at 19:17, Erick Erickson <> wrote:
> Unless I don't understand at all what you're going for, wouldn't
> it work to just put the HTML through some kind of parser (strict or
> loose depending on how well-formed your HTML is), then just
> extract the text from your document and push them into your
> Lucene document? Various parsers make this more or less
> simple...
That's more or less what I was suggesting.  The problem as I see it is
that Lucene wants to do its own tokenizing step.  I declared my
IndexWriter like this:
writer = new IndexWriter(IndexDirectory, new MySpecialAnalyzer(),
true, MaxFieldLength.UNLIMITED);
and the code in the MySpecialAnalyzer class is indeed called later on.

So, I think this approach:
> domObj = parse(htmldocument);
> Document lucDoc = new Document();
> lucDoc.add("insideh1", domObj.getText(<dom path to H1>));
(etc) won't work, because when I put that text in it'll be analyzed again.

Perhaps I'll write a ZeroSplittingAnalyzer or something, do all the
work before I give anything to Lucene, then '\0'-join my tokens and
feed them to the simple analyzer.  So something like this:
Document doc = new Document();
doc.add(new Field("h1", "hello\0world"));
doc.add(new Field("alltext", "hello\0world\0goodnight\0moon"));

I think that makes sense.  Comments?


> Erick
> On Tue, Oct 27, 2009 at 6:50 PM, Will Murnane <>wrote:
>> Hello list,
>>  I have some semi-structured text that has some markup elements, and
>> I want to put those elements into a separate field so I can search by
>> them.  For example (using HTML syntax):
>> ---- 8< ---- document
>> <h1>Section title</h1>
>> Body content
>> ---- >8 ----
>> I can find that the things inside <h1>s are "Section" and "title", and
>> "Body" and "content" are outside.  I want to create two fields for
>> this document:
>> insideh1 -> "Section", "title"
>> alltext -> "Section", "title", "Body", "content"
>> What's the best way to approach this?  My initial thought is to make
>> some kind of MultiAnalyzer that consumes the text and produces several
>> token streams, which are added to the document one at a time.  Is that
>> a reasonable strategy?
>> Thanks!
>> Will
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message