DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=34629>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=34629
------- Additional Comments From nicoo_@hotmail.com 2005-04-28 16:04 -------
(In reply to comment #2)
> Nicolas, thanks for the contribution! I took a quick look at the ZIP file.
> Would it be possible for you to describe (here and/or in the Javadocs) how these
> 12+ classes work to provide Document update functionality?
The goal of this contribution is to overwrite only the files containing
information about the term posting list ( .tis , .tii, .frq, etc..).
In the Lucene API, the term posting lists are accessible with
IndexReader.Terms() (Enumerate all the terms) and IndexReader.TermPositions()
(For a specific term, enumerate each pair <doc number, Freq, <position>^freq >
)
methods.
So, if i modified the output of this 2 methods (add new terms, delete relations
between document and terms, etc..) and rewrite the output in the lucene index, I
recreate a new lucene term posting list. That's what this contribution does !
To do this, i create a interface called TermProducter containing this 2 methods
(Terms() and TermPositions()).A class implementing this interface have to
produce this 2 kind of ouputs (so it produce the posting lists). For Exemple a
IndexReader could implements this interface, but you can also create your own
term posting list producter, or create a TermProducter that modify the content
of the original IndexReader ouput.
Then, with the TermWriter class that takes in input a TermProducter and a lucene
index, you can rewrite the lucene term posting list with the content of the
TermProducter.
So now the question is : How can i modified the term posting list ? , What are
my tools ?
You have 2 types of Tools : TermGenerator and TermTransformer.
* The TermGenerator Interface. It generates a TermProducter instance. Its goal
is to create a new posting list. The interface is simple:
public TermProducter CreateTermProducter();
There are 2 proposed Implementations:
- TermReader . A IndexReader Wrapper implementing TermProducter
- TermAdder . you can create your own posting list by adding term/documen
relation. It's like a virtual index.
* The TermTransformer Interface. It modifies the output of a TermProducter. The
interface is:
public TermProducter transform(TermProducter producter);
There are 2 proposed Implementations:
- TermFilter. Filter some term/doc relations
- TermReplacer. You can replace some term/doc relations by others relations
* You have also a special TermProducter implementation called TermMerger. It
merges several TermProducter. (useful )
void add(TermProducter producter )
terms()
termPositions();
Now we can play by combining and create a kind of pipeline. For exemple, a
update process :
(1) TermReader----> (2) TermFilter ----> (4)TermMeger (-----> (5) TermWriter )
|
(3) TermAdder --->-----+
1 - we read the lucene posting list
2- we delete somes terms
3 - wa add new term
4- we merge the 2 TermProducters to create the final TermProducter
5- we write the termproducter informations in the lucene index.
This design allows flexibility because If i just want replace terms i can use
this simple/optimized process:
(1) TermReader----> (2) TermReplacer (---->TermWriter )
So you can create your own pipeline of terms transformation.
--- A COMPLET EXEMPLE ---
Use case: i have to delete a term in several documents.
1 - I have to know all the lucene document numbers.
The main class is the IndexUpdater. It contains a TermWriter and allow to select
the desired doc.
So i must create a instance.
IndexUpdater updater = IndexUpdater(IndexReader reader);
After this, i can execute a lucene query to select all the desired documents, to
DocumentSelection docsel=updater.selectDoc(Query query);
Ok now i have a DocumentSelection instance allowing to a
TermGenerator/TermTransformer to know which document is selected or not to
delete the terms.
2 - delete their relations with the desired terms.
So now I create a TermFilter and delete the term in the selected document.
filter=new TermFilter();
filter.deleteTerm(new Term("field","deletedvalue"), docsel);
3- now i create a pipeline like this: TermReader----> TermFilter (
---->TermWriter )
We have a method in the IndexUpdater to create a TermReader of the lucene index.
TermReader reader= updater.getTermReader();
TermProducter finalProducter=filter.transform(reader.createTermProducter());
updater.setTermProducter(finalProducter);
4- I close and so write in the index the new posting lists.
updater.close();
Ok , is it clear ?
PS: 1 - sorry for english, 2 - I know this contribution is not perfect (name of
classes, design, implementation) and can be certainly fixed but it's a first
step to a easy update of the postings lists, a lack in Lucene.
--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
|