lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Indexing the same field multiple times in a doc.
Date Tue, 15 Aug 2006 21:10:28 GMT
That's waaaaay cool. I'm not entirely clear on this point, but my test sure
worked out well. So I thought I'd see if the behavior I'm seeing is
expected. If so, way cool coding guys!

Here's the root question: "Am I reasonably safe, for a single document, in
thinking of indexing multiple chunks with the same field as being identical,
for all practical purposes, with indexing the field once with all the chunks
concatenated together?".

Details follow...

Background: I'm working with books now, so the 10,000 word limit isn't
acceptable. It is, of course, easy to bump it up via
IndexWriter.setMaxFieldLength, but I'd rather just break up the input stream
since I don't have a realistic expectation of putting an upper limit on the
size of a doc. So I experimented a bit with indexing the same field multiple
times in a document. It seems to me that everything "just works", at least
in my simple test case. This is one heck of a coincidence if not intended
behavior <G>, so all I'm asking is if anybody is surprised by this
(especially Otis, Chris, Erik, etc).

Please ignore the Field.Store, I realize it will probably have to be NO for
this app......

Lucene is fine with indexing the same field multiple times, like this....

    Document doc = new Document();
    doc.add(new Field("tokens", "one two three", Field.Store.YES,
Field.Index.TOKENIZED));

    doc.add(new Field("tokens", "four five six", Field.Store.YES,
Field.Index.TOKENIZED));
    writer.addDocument(doc);

What surprised me a bit is that SpanQueries work just fine this way. If I
create a span query for "two" and "five", this doc is found for some slop
factors and not found for other slop factors, just as though I indexed the
"tokens" field once with "one two three four five six".

Boolean MUST clauses work as well.

So, are there any "gotchas" that spring to mind with the notion of chunking
the input to < 10,000 words and indexing the chunks multiple times in the
same field? Let me be clear I'm just beginning to design this, so all I'm
thinking about here is whether indexing the same field multiple times is
fraught with danger <G>, I have a bunch of details to work out about what
the requirements are and whether I really *want* to do this, but that's my
problem.

Thanks
Erick

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message