lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Adding IndexOutput.writeByte(byte b, int length)
Date Mon, 22 Jun 2009 15:34:20 GMT
I've indexed 200K docs, fields indexed as ANALYZED (which include norms),
but the fields were sparse. The "holes" I've seen were thousands (sometimes
even 80K). Now that I understand this better, I realize that particular
indexing code is incorrect, and I should have disabled NORMS. After I did
it, performance really improved.

So if judging by the buggy indexing code, this fix is not needed. And I
guess large "holes" really represent a bug, rather than a common scenario.
So I take this proposal back :).

The code I've used is from benchmark, TrecContentSource, which takes all the
<meta> tags from the HTML files and puts them as properties on DocData, and
DocMaker later on adds them to the Document. That's what created the
sparseness. I think I'm going to add two things to benchmark:
1. Add a doc.tokenized.norms property and if set to false, it will use
Index.ANALYZED_NO_NORMS or Index.NOT_ANALYZED_NO_NORMS
2. Add to TrecContentSource a keep.properties attribute, which if set to
false will set DocData.props to null. I think for TREC, it really doesn't
make sense to index all the <meta> tags.

Shai

On Mon, Jun 22, 2009 at 5:10 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> This code isn't invoked that often, I believe.  It only happens when
> there are "holes" in the norms between docs, ie you have a field that
> has norms enabled (at least one Document had this Field w/ norms
> enabled in the past), but then you had a series of Docs that had
> disabled norms for the field and so you must fill the hole (since
> norms aren't sparse).
>
> So I think in practice it won't help much?  (And, writing long series
> of the same byte is something in general we shouldn't "try" to do ;)
> So I'm not sure I want a public API "inviting" it).
>
> Mike
>
> On Mon, Jun 22, 2009 at 9:04 AM, Shai Erera<serera@gmail.com> wrote:
> > I'm testing the performance of some indexing code and noticed that
> > NormsWriter.flush() calls IndexOutput.writeByte(defaultNorm) in a loop,
> > writing the same norm every time (lines: 139-140, 157-158, 162-163).
> >
> > In the run I've spotted it, it occurs few thousands of times (I mean few
> > thousands of writeByte calls).
> >
> > I was thinking that if we had writeByte(byte b, int lenght) in
> IndexOutput,
> > we can call it once and handle it effeciently where possible. For
> > back-compat, the default impl would just be looping and calling
> > writeByte(b), but for others, like BufferedIndexOutout, this could be
> > filling the array with b, length times. We won't use System.arraycopy
> which
> > is faster, but won't call thousands of times to writeByte either.
> >
> > What do you think?
> >
> > Shai
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message