lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liu Ivan" <ialy2...@gmail.com>
Subject Re: duplication checking while indexing
Date Tue, 30 Dec 2008 03:55:02 GMT
I use JDBM store document's key ID.

2008/12/30 Chris Lu <chris.lu@gmail.com>

> Otis, thanks for the pointer.
> I think the question can be:
>
>  How to access TermEnum or TermInfos during indexing.
>
> If this is possible, things would be easier.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
>
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
>  On Mon, Dec 29, 2008 at 10:41 AM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
>
> > Chris,
> >
> > Mark Miller & Co. are working on (Near) Duplicate Detection.  I think the
> > work is in Solr's JIRA, but some of it might be applicable to Lucene.
> >
> >  Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Chris Lu <chris.lu@gmail.com>
> > > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
> > > Sent: Monday, December 29, 2008 4:55:14 AM
> > > Subject: duplication checking while indexing
> > >
> > > I am wondering whether there is an easy way to avoid duplication while
> > > indexing, just using the index being created, without creating other
> data
> > > structures.
> > > In some cases, the incoming document list can have duplicates. For
> > example,
> > > when creating spell checking indexes for phrases. Each phrase is one
> > > document. So I want to check whether the phrase is already indexed or
> > not.
> > >
> > > To do so, I can either create a hash map for all the indexed phrases.
> But
> > > the hash map would consume a lot of memory.
> > > A possible alternative is to search existing index. But remember the
> > index
> > > is being created, and not all contents are flushed to disk yet.
> > >
> > > Is it possible to query the not-yet-closed index?
> > >
> > > --
> > > Chris Lu
> > > -------------------------
> > > Instant Scalable Full-Text Search On Any Database/Application
> > > site: http://www.dbsight.net
> > > demo: http://search.dbsight.com
> > > Lucene Database Search in 3 minutes:
> > >
> >
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> > > DBSight customer, a shopping comparison site, (anonymous per request)
> got
> > > 2.6 Million Euro funding!
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>



-- 
刘平华


广州从兴电子开发有限公司

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::



*:ialy2005 at gmail.com

+:广州市广州大道南368号12F   510300

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

网站:http://www.searchfull.net
Blog:http://www.searchfull.net/blog/
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message