lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 34629] - [PATCH] Document update contrib (Play with term postings or .. to a easy way to update)
Date Thu, 28 Apr 2005 14:04:42 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=34629>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=34629





------- Additional Comments From nicoo_@hotmail.com  2005-04-28 16:04 -------
(In reply to comment #2)
> Nicolas, thanks for the contribution!  I took a quick look at the ZIP file. 
> Would it be possible for you to describe (here and/or in the Javadocs) how these
> 12+ classes work to provide Document update functionality?


The goal of this contribution is to overwrite only the files containing
information about the term posting list ( .tis , .tii, .frq, etc..).
In the Lucene API, the term posting lists are accessible with
IndexReader.Terms() (Enumerate all the terms) and IndexReader.TermPositions()
(For a specific term, enumerate each pair <doc number, Freq, <position>^freq >
)
methods.

So, if i modified the output of this 2 methods (add new terms, delete  relations
between document and terms, etc..) and rewrite the output in the lucene index, I
recreate a new lucene term posting list. That's what this contribution does !

To do this, i create a interface called TermProducter containing this 2 methods
(Terms() and TermPositions()).A class implementing this interface have to
produce this 2 kind of ouputs (so it produce the posting lists). For Exemple  a
IndexReader could implements this interface, but you can also create your own
term posting list producter, or  create a TermProducter that modify the content
of the original IndexReader ouput.

Then, with the TermWriter class that takes in input a TermProducter and a lucene
index, you can rewrite the lucene term posting list with the content of the
TermProducter.

So now the question is : How can i modified the term posting list ? , What are
my tools ?
You have 2 types of Tools : TermGenerator and TermTransformer.
 * The TermGenerator Interface. It generates a TermProducter instance. Its goal
is to create a new posting list.   The interface is simple: 
	public 	TermProducter CreateTermProducter();

 There are 2 proposed Implementations:
   - TermReader . A IndexReader Wrapper implementing TermProducter
   - TermAdder  . you can create your own posting list by adding term/documen
relation. It's like a virtual index.


* The TermTransformer Interface. It modifies the output of a TermProducter. The
interface is:
	public TermProducter transform(TermProducter producter); 

There are 2 proposed Implementations:
 - TermFilter. Filter some term/doc relations
 - TermReplacer. You can replace some term/doc relations by others relations


* You have also a special TermProducter implementation called TermMerger. It
merges several TermProducter. (useful )
	void add(TermProducter producter )
	terms()
	termPositions();



Now we can play by combining and create a kind of pipeline. For exemple, a
update process  :

(1) TermReader----> (2) TermFilter ----> (4)TermMeger (-----> (5) TermWriter )
                                                |
                         (3) TermAdder --->-----+
			      
1 - we read the lucene posting list
2- we delete somes terms
3 - wa add new term
4- we merge the 2 TermProducters to create the final TermProducter
5- we write the termproducter informations in the lucene index.


This design allows flexibility because If i just want replace  terms i can use
this simple/optimized process:
(1) TermReader----> (2) TermReplacer (---->TermWriter )

So you can create your own pipeline of terms transformation.


--- A COMPLET EXEMPLE --- 
Use case: i have to delete a term in several documents. 

1 - I have to know all the lucene document numbers. 

The main class is the IndexUpdater. It contains a TermWriter and allow to select
the desired doc.
So i must create a instance. 

     IndexUpdater updater = IndexUpdater(IndexReader reader);

After this,  i can execute a lucene query to select all the desired documents,  to

     DocumentSelection docsel=updater.selectDoc(Query query);

Ok now i have a DocumentSelection instance allowing to a
TermGenerator/TermTransformer to know which document is selected or not to
delete the terms.


2 - delete their relations with the desired terms.
So now I create a TermFilter and delete the term in the selected document.

filter=new TermFilter();
filter.deleteTerm(new Term("field","deletedvalue"), docsel);


3- now i create a pipeline  like this:  TermReader---->  TermFilter (
---->TermWriter )

We have a method in the IndexUpdater to create a TermReader of the lucene index.

TermReader reader= updater.getTermReader();

TermProducter finalProducter=filter.transform(reader.createTermProducter());

updater.setTermProducter(finalProducter);

4- I close and so write in the index the new posting lists.
updater.close();




Ok , is it clear ?


PS: 1 - sorry for english, 2 - I know this contribution is not perfect (name of
classes, design, implementation) and can be certainly fixed but it's a first
step to a easy update of the postings lists, a lack in Lucene.


-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message