Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (hermes.apache.org: domain of clamprecht@gmail.com
 designates 64.233.170.204 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=VliGDe2m6Upm2e+FouiaZKy+R/TOFNJKThhahLZDYyULwYlDXDTgAmJ1TqVYKe2XtNOBpk3HbUTGRerJ55IUoQkmVRHHnB2VexJbJ+pr0uH0U/Cvqkk2OKP0xiZP4iYKe0D6fG9sGrvyV5cLV+7qrr4ZfAYjfRnMtN5tS0hpXJ4=
Message-ID: <88c6a67205061215577c9955ae@mail.gmail.com>
Date: Sun, 12 Jun 2005 17:57:15 -0500
From: Chris Lamprecht <clamprecht@gmail.com>
Reply-To: Chris Lamprecht <clamprecht@gmail.com>
To: java-user@lucene.apache.org, Dave Kor <s0454888@sms.ed.ac.uk>
Subject: Re: Ideas Needed - Finding Duplicate Documents
In-Reply-To: <1118612055.42acaa57af6bd@sms.ed.ac.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <42ABFE88.8050206@axtelsoft.com>
	 <1118587070.42ac48be751be@sms.ed.ac.uk>
	 <88c6a67205061211465fc2bd23@mail.gmail.com>
	 <1118612055.42acaa57af6bd@sms.ed.ac.uk>

I'd have to see your indexing code to see if there are any obvious
performance gotchas there.  If you can run your indexer under a
profiler (OptimizeIt, JProbe, or just the free one with java using
-Xprof), it will tell you in which methods most of your CPU time is
spent.  If you're using StandardAnalyzer, then this may be it --
StandardAnalyzer is a fairly advanced grammar-based parser, but it is
pretty slow.  If you don't need its functionality, then try using a
simpler Analyzer, (like WhitespaceAnalyzer or a subclass).

As far as changing a document within an index -- there is no "update"
operation for documents, there's just delete and add (and then
optimize).  Delete only marks docs as deleted (so they don't come back
in search results); they aren't physically removed from the index
files until you optimize.

Also, it isn't fatal that your current index doesn't have MD5 info in
it.  It's pretty fast to compute MD5 at search time for each document
returned (much faster than the I/O-bound part -- actually retrieving
the docs from the Lucene index).  So you could try just doing all your
duplicate detection at search time.  If this is too slow, you could
consider caching the computed MD5 for your docs.

-chris

On 6/12/05, Dave Kor <s0454888@sms.ed.ac.uk> wrote:
> Thanks for the quick reply, Chris.
>=20
> Yes, when I say "duplicate" sentences, they are exact copies of the same =
string.
>=20
> The MD5 hash is a good idea, I wish I had thought of it earlier as it wou=
ld have
> saved me a lot of trouble. Right now it is not feasible to reindex again =
because
> indexing is a very slow and cpu intensive task for me. I'm adding
> part-of-speech, chunk, named entity and coreference information as I inde=
x,
> which means it takes 4 separate servers and 4-5 days of processing to cre=
ate a
> new index. And as far as I know, you can't change the index once its crea=
ted.
> Am I correct?
>=20
> Any other ideas that don't require me to re-index the whole thing?
>=20
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org