lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Duke DAI <duke.dai....@gmail.com>
Subject Re: Hardcoded checksum mechanism in BlockTreeTermsReader
Date Mon, 26 Dec 2016 02:50:14 GMT
Thanks Mike and Uwe for your detailed explanation.

In practice of my usage, there will be 5 major segments and tip files total
size is about 20 MBs. It seems all data in tip will be accessed after the
check so that the check also works as warmup. Now that performance
punishment is trivial and benefit is obvious, I've added checksum in my
customized IndexOutput.

Thanks for your help and happy jolly holidays!



Best regards,
Duke
If not now, when? If not me, who?

On Tue, Dec 6, 2016 at 9:39 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> The checksum is also written for a second reason: Java VMs often have
> optimization bugs (you may know the Java 7 GA disaster and Java 7u40 vector
> optimization bugs that Lucene discovered). The checksums will often catch
> those bugs, too.
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Michael McCandless [mailto:lucene@mikemccandless.com]
> > Sent: Tuesday, December 6, 2016 12:30 PM
> > To: Duke DAI <duke.dai.007@gmail.com>
> > Cc: Lucene Users <java-user@lucene.apache.org>
> > Subject: Re: Hardcoded checksum mechanism in BlockTreeTermsReader
> >
> > I see.  Bits can also be flipped by the network as they are travelling
> > to/from the DB.  The end to end checksum Lucene does now would catch
> > that.
> >
> > Anyway, that BlockTree index file that is being entirely checksummed
> > is a very small file.  And, using the first pattern is not easy for it
> > because it needs to seek to the end to load its directory location,
> > and then seek back to that location to read each field's information.
> > Do you see a simple way to change it to the first pattern?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Tue, Dec 6, 2016 at 6:00 AM, Duke DAI <duke.dai.007@gmail.com>
> > wrote:
> > > Thanks for your quick response, Mike.
> > >
> > > Database has its own raw page management over OS page management,
> > and most
> > > likely database has its own checksum on page level, that's why I want
> to
> > > avoid checksum in Lucene Directory level.
> > >
> > > Certainly checksum is good, I like the pattern(rewrite
> openChecksumInput
> > > according to real case):
> > > inputStream = directory.openChecksumInput(...);
> > > // at the end check checksum, as by-product
> > > CodecUtil.checkFooter(...)
> > >
> > > But I do not like the pattern:
> > > CodecUtil.checksumEntireFile(..), its purpose is pure checksum via
> reading
> > > all data, not the by-product.
> > > If the design/API is pluggable with default way, it'll be good enough
> for
> > > various scenario.
> > >
> > >
> > >
> > >
> > > Best regards,
> > > Duke
> > > If not now, when? If not me, who?
> > >
> > > On Tue, Dec 6, 2016 at 6:36 PM, Michael McCandless
> > > <lucene@mikemccandless.com> wrote:
> > >>
> > >> We have learned over time not to trust the underlying store to
> > >> correctly record the bytes we wrote to it.
> > >>
> > >> This is why checksumming is very strongly built into Lucene at this
> > >> point.  If you disable checksumming, when bits do flip, you get exotic
> > >> exceptions at search time that might look like Lucene bugs and can
> > >> cost a lot of time to explain.
> > >>
> > >> It's not just the BlockTreeTermsReader; many other codec components
> > >> check the checksum with CodecUtil.checkFooter at search time.
> > >>
> > >> Can you explain why it's necessary to remove it for your database
> > >> files based Directory?
> > >>
> > >> Mike McCandless
> > >>
> > >> http://blog.mikemccandless.com
> > >>
> > >>
> > >> On Tue, Dec 6, 2016 at 5:25 AM, Duke DAI <duke.dai.007@gmail.com>
> > wrote:
> > >> > Hi all,
> > >> >
> > >> > I'm customizing Lucene Directory, which extends
> o.a.l.store.Directory
> > >> > based
> > >> > on database files. I do not need checksum again on IndexIndex and
> > >> > IndexOutput.
> > >> >
> > >> > But in BlockTreeTermsReader constructor, following code open a
> > >> > hard-coded BufferedChecksumIndexInput to checksum on raw
> > IndexInput. I
> > >> > have
> > >> > to use CRC32 on IndexOutput to make through it. Is there any more
> > >> > graceful
> > >> > way to do checksum, such as let Directory construct a checksum
> instance
> > >> > instead of API Directory.openChecksumInput ?
> > >> >
> > >> >
> > >> >       String indexName = IndexFileNames.segmentFileName(segment,
> > >> > state.segmentSuffix, TERMS_INDEX_EXTENSION);
> > >> >       indexIn = state.directory.openInput(indexName,
> state.context);
> > >> >       CodecUtil.checkIndexHeader(indexIn, TERMS_INDEX_CODEC_NAME,
> > >> > version,
> > >> > version, state.segmentInfo.getId(), state.segmentSuffix);
> > >> >       CodecUtil.checksumEntireFile(indexIn);
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > Best regards,
> > >> > Duke
> > >> > If not now, when? If not me, who?
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message