lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Lu" <chris...@gmail.com>
Subject Re: duplication checking while indexing
Date Tue, 30 Dec 2008 09:13:16 GMT
JDBM is surely a better way than in memory hash map.
But I feel since all previous documents are already in the index, although
not closed yet, there should be a way to read all previous terms.
It's ok to use additional data structure, like JDBM or hash map, to
duplicate the terms, in order to look it up. But it doesn't feel right.

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Mon, Dec 29, 2008 at 7:55 PM, liu Ivan <ialy2005@gmail.com> wrote:

> I use JDBM store document's key ID.
>
> 2008/12/30 Chris Lu <chris.lu@gmail.com>
>
> > Otis, thanks for the pointer.
> > I think the question can be:
> >
> >  How to access TermEnum or TermInfos during indexing.
> >
> > If this is possible, things would be easier.
> >
> > --
> > Chris Lu
> > -------------------------
> > Instant Scalable Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> > Lucene Database Search in 3 minutes:
> >
> >
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> > DBSight customer, a shopping comparison site, (anonymous per request) got
> > 2.6 Million Euro funding!
> >
> >  On Mon, Dec 29, 2008 at 10:41 AM, Otis Gospodnetic <
> > otis_gospodnetic@yahoo.com> wrote:
> >
> > > Chris,
> > >
> > > Mark Miller & Co. are working on (Near) Duplicate Detection.  I think
> the
> > > work is in Solr's JIRA, but some of it might be applicable to Lucene.
> > >
> > >  Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > >
> > >
> > > ----- Original Message ----
> > > > From: Chris Lu <chris.lu@gmail.com>
> > > > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
> > > > Sent: Monday, December 29, 2008 4:55:14 AM
> > > > Subject: duplication checking while indexing
> > > >
> > > > I am wondering whether there is an easy way to avoid duplication
> while
> > > > indexing, just using the index being created, without creating other
> > data
> > > > structures.
> > > > In some cases, the incoming document list can have duplicates. For
> > > example,
> > > > when creating spell checking indexes for phrases. Each phrase is one
> > > > document. So I want to check whether the phrase is already indexed or
> > > not.
> > > >
> > > > To do so, I can either create a hash map for all the indexed phrases.
> > But
> > > > the hash map would consume a lot of memory.
> > > > A possible alternative is to search existing index. But remember the
> > > index
> > > > is being created, and not all contents are flushed to disk yet.
> > > >
> > > > Is it possible to query the not-yet-closed index?
> > > >
> > > > --
> > > > Chris Lu
> > > > -------------------------
> > > > Instant Scalable Full-Text Search On Any Database/Application
> > > > site: http://www.dbsight.net
> > > > demo: http://search.dbsight.com
> > > > Lucene Database Search in 3 minutes:
> > > >
> > >
> >
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> > > > DBSight customer, a shopping comparison site, (anonymous per request)
> > got
> > > > 2.6 Million Euro funding!
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
>
> --
> 刘平华
>
>
> 广州从兴电子开发有限公司
>
>
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
>
>
>
> *:ialy2005 at gmail.com
>
> +:广州市广州大道南368号12F   510300
>
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
>
> 网站:http://www.searchfull.net
> Blog:http://www.searchfull.net/blog/
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message