lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Lu" <chris...@gmail.com>
Subject Re: duplication checking while indexing
Date Mon, 29 Dec 2008 23:40:54 GMT
Otis, thanks for the pointer.
I think the question can be:

  How to access TermEnum or TermInfos during indexing.

If this is possible, things would be easier.

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Mon, Dec 29, 2008 at 10:41 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Chris,
>
> Mark Miller & Co. are working on (Near) Duplicate Detection.  I think the
> work is in Solr's JIRA, but some of it might be applicable to Lucene.
>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Chris Lu <chris.lu@gmail.com>
> > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
> > Sent: Monday, December 29, 2008 4:55:14 AM
> > Subject: duplication checking while indexing
> >
> > I am wondering whether there is an easy way to avoid duplication while
> > indexing, just using the index being created, without creating other data
> > structures.
> > In some cases, the incoming document list can have duplicates. For
> example,
> > when creating spell checking indexes for phrases. Each phrase is one
> > document. So I want to check whether the phrase is already indexed or
> not.
> >
> > To do so, I can either create a hash map for all the indexed phrases. But
> > the hash map would consume a lot of memory.
> > A possible alternative is to search existing index. But remember the
> index
> > is being created, and not all contents are flushed to disk yet.
> >
> > Is it possible to query the not-yet-closed index?
> >
> > --
> > Chris Lu
> > -------------------------
> > Instant Scalable Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> > Lucene Database Search in 3 minutes:
> >
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> > DBSight customer, a shopping comparison site, (anonymous per request) got
> > 2.6 Million Euro funding!
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message