lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: who clears attributes?
Date Tue, 11 Aug 2009 22:14:44 GMT
Hi DM,


It is not public at the moment and still in development. I can public the
XML tokenizer when it is finished.


In general it shows one possible use-case for custom attributes. Maybe we
get something like this in future: Just tag all tokens with the field name
(using a FieldNameAttribute) and the Document/Indexer can automatically
create the fields?


Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen


From: DM Smith [] 
Sent: Tuesday, August 11, 2009 5:54 PM
Subject: Re: who clears attributes?



Is this example available?  I think that an example like this would help the
user community see the current value in the change. At least, I'd love to
see the code for it.

-- DM

On 08/10/2009 06:49 PM, Uwe Schindler wrote: 

> UIMA....

The new API looks like UIMA, you have streams that are attributed with
various attributes that can be exchanged between TokenStreams/TokenFilters.
Just like the current FlagsAttribute or TypeAttribute, that can easily
misused for such things.


About a real use case for the new API:


I talked some time ago with Grant in the podcast about NumericRange and the
Publishing Network for Geoscientific Date called PANGAEA. At the end of the
talk (available on the Lucid Imagination website), there were some
explanations, how we index our XML documents that one could ask for contents
of a specific XML element name (element name is field name) or a XPath-like
path as field name. E.g. if you have an XML document like this: (please note: this is just a very
simple XML schema we use for indexing our documents). When we index this
document type into Lucene, we create a new field for each element name, e.g.
"lastName", "firstName" and so on. One could easily search for any document
where anywhere (not only in citation), a specific "lastName" appears. We
also create fields for more general element names. So you could also look
inside field name "citation", to search anywhere in the citation. You could
also combine, to only find documents where the "lastName" of an "author" is
"Xyz" by using the field name "author:lastName". In the past (before the new
API, I wrote this analyzer very complicated and created StringBuffers for
earch element name, where I appended the text and then analyzed it for each
field name again.


Now I pass the XML document in my special XMLTokenStream that uses STAX/DOM
to retrieve the element names and contents. Each element creates a new
TermAttribute (with the whole contents as one term) and a custom Attribute
holding the reference to the current element name and all previous higher
level element names (the Attribute contains a Stack of element names). This
special Attribute is then in the Tokenizer chain and only updated by the
root XMLTokenStream. The next filter in the chain is a WhitespaceFilter
(that splits up the tokens at white space) and so on to further tokenize the
element contents. The special element name stack attribute is untouched, but
always contains the current element name for later filtering. The last step
is using the new TeeSinkTokenFilter to index the stream into different
fields. The TeeSinkTokenFilter gets Sinks for each field name/element name
hierarchy (which are recorded before), each Sink filters the Tokens using
the special element stack attribute for matching tokens the field is
interested. By that I can simply analyze the whole XML document one time and
distribute the contents to various field names using the additional


Here an example (using the above schema), that shows all documents with a
title of "Evidence from Fram Strait" in the publication where the dataset is
attached to as supplement:
Strait%22 (which hits only the above example). The query parser is
customized (not the Lucene one).


The final code of this TokenStream is a little bit more complicated that
described here, but it gives a possible usage of the new API: Annotate
tokens with field identifiers to e.g. automatically put the title of a
document in a title field and the authors in another one and so on.


I hope somebody understood, what we are doing here :-)


Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen


From: Shai Erera [] 
Sent: Monday, August 10, 2009 11:13 PM
Subject: Re: who clears attributes?


It sounds like the 'old' API should stay a bit longer than 3.0. We'd like to
give more people a chance to experiment w/ the new API before we claim it is
the new Analysis API in Lucene. And that means that more users will have to
live w/ the "bit of slowness" more than what is believed in this thread.

I personally worry much about needing to throw away the current API. I'll
have a lot of code to port over and I haven't read anything so far that
convinces me the new API is better. I don't have any problems w/ the current
API today. I feel I have all the flexibility I need w/ indexing fields. I
use payloads, Field.Index constants, write Analyzers, TokenStreams ...
actually I have 0 complaints.

Maybe we should follow what I seem to read from Earwin and Grant - come up
w/ real use cases, try to implement them w/ the current API, then if it's
impossible, discuss how we can make the current API more adaptive. If at the
end of this we'll get back to the new API, then we'll at least feel better
about it, and more convinced it is the way to go.

Hack .. maybe we'll be convinced to base the Luceue analysis on UIMA? :)


On Mon, Aug 10, 2009 at 11:54 PM, Uwe Schindler <> wrote:

> >> I have serious doubts about releasing this new API until these
> >> performance issues are resolved and better proven out from a
> >> usability
> >> standpoint.
> >
> > I think LUCENE-1796 has fixed the performance problems, which was
> > caused by
> > a missing reflection-cache needed for bw compatibility. I hope to
> > commit
> > soon!
> >
> > 2.9 may be a little bit slower when you mix old and new API and do
> > not reuse
> > Tokenizers (but Robert is already adding reusableTokenStream to all
> > contrib
> > analyzers). When the backwards layer is removed completely or
> > setOnlyUseNewAPI is enabled, there is no speed impact at all.
> >
> The Analysis features of Lucene are the single most common place where
> people enhance Lucene.  Very few add queries, or muck with field
> caches, but they do write their own Analyzers and TokenStreams,
> etc.    Within that, mixing old and new is likely the most common case
> for everyone who has made their own customizations, so a "little bit
> slower" is something I'd rather not live with just for the sake of
> some supposed goodness in a year or two.

But because of this flexibility, we added the backwards layer. The old style
with setUseNewAPI was not flexible at all, and nobody would move his
Tokenizers to the new API without that flexibility (maybe he uses external
analyzer packages not yet updated).

With "a little bit" I mean the cost of wrapping the old and new API is
really minimal, it is just an if statement and a method call, hopefully
optimized away by the JVM. In my tests the standard deviation between
different test runs was much higher than the difference between mixing
old/new API (on Win32), so it is not really sure, that the cost comes from
the delegation.

The only case that is really slower is (now minimized cost of creation in
TokenStream.<init>, if you not reuse TokenStreams: Two LinkedHashMaps have
to be created and setup). But this is not caused by the backwards layer.


To unsubscribe, e-mail:
For additional commands, e-mail:



View raw message