lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Indexing the same field multiple times in a doc.
Date Tue, 15 Aug 2006 22:08:43 GMT
I think I just answered my own question. No, there's no difference.
Including running into an OutOfMemory error when adding the doc. It seems
like I can think of this exactly as concatenating all the chunks together
and indexing them all at once, including the limitations.

So even though it all seems to work for small examples, I don't think it
helps.

Never mind......

Erick

On 8/15/06, Erick Erickson <erickerickson@gmail.com> wrote:
>
> That's waaaaay cool. I'm not entirely clear on this point, but my test
> sure worked out well. So I thought I'd see if the behavior I'm seeing is
> expected. If so, way cool coding guys!
>
> Here's the root question: "Am I reasonably safe, for a single document, in
> thinking of indexing multiple chunks with the same field as being identical,
> for all practical purposes, with indexing the field once with all the chunks
> concatenated together?".
>
> Details follow...
>
> Background: I'm working with books now, so the 10,000 word limit isn't
> acceptable. It is, of course, easy to bump it up via
> IndexWriter.setMaxFieldLength, but I'd rather just break up the input
> stream since I don't have a realistic expectation of putting an upper limit
> on the size of a doc. So I experimented a bit with indexing the same field
> multiple times in a document. It seems to me that everything "just works",
> at least in my simple test case. This is one heck of a coincidence if not
> intended behavior <G>, so all I'm asking is if anybody is surprised by this
> (especially Otis, Chris, Erik, etc).
>
> Please ignore the Field.Store, I realize it will probably have to be NO
> for this app......
>
> Lucene is fine with indexing the same field multiple times, like this....
>
>     Document doc = new Document();
>     doc.add(new Field("tokens", "one two three", Field.Store.YES,
> Field.Index.TOKENIZED));
>
>     doc.add(new Field("tokens", "four five six", Field.Store.YES,
> Field.Index.TOKENIZED ));
>     writer.addDocument(doc);
>
> What surprised me a bit is that SpanQueries work just fine this way. If I
> create a span query for "two" and "five", this doc is found for some slop
> factors and not found for other slop factors, just as though I indexed the
> "tokens" field once with "one two three four five six".
>
> Boolean MUST clauses work as well.
>
> So, are there any "gotchas" that spring to mind with the notion of
> chunking the input to < 10,000 words and indexing the chunks multiple times
> in the same field? Let me be clear I'm just beginning to design this, so all
> I'm thinking about here is whether indexing the same field multiple times is
> fraught with danger <G>, I have a bunch of details to work out about what
> the requirements are and whether I really *want* to do this, but that's my
> problem.
>
> Thanks
> Erick
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message