lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: who clears attributes?
Date Tue, 11 Aug 2009 15:53:33 GMT
Uwe,

Is this example available?  I think that an example like this would help 
the user community see the current value in the change. At least, I'd 
love to see the code for it.

-- DM

On 08/10/2009 06:49 PM, Uwe Schindler wrote:
>
> > UIMA....
>
> The new API looks like UIMA, you have streams that are attributed with 
> various attributes that can be exchanged between 
> TokenStreams/TokenFilters. Just like the current FlagsAttribute or 
> TypeAttribute, that can easily misused for such things.
>
> About a real use case for the new API:
>
> I talked some time ago with Grant in the podcast about NumericRange 
> and the Publishing Network for Geoscientific Date called PANGAEA. At 
> the end of the talk (available on the Lucid Imagination website), 
> there were some explanations, how we index our XML documents that one 
> could ask for contents of a specific XML element name (element name is 
> field name) or a XPath-like path as field name. E.g. if you have an 
> XML document like this: http://www.pangaea.de/PHP/getxml.php/51675 
> (please note: this is just a very simple XML schema we use for 
> indexing our documents). When we index this document type into Lucene, 
> we create a new field for each element name, e.g. "lastName", 
> "firstName" and so on. One could easily search for any document where 
> anywhere (not only in citation), a specific "lastName" appears. We 
> also create fields for more general element names. So you could also 
> look inside field name "citation", to search anywhere in the citation. 
> You could also combine, to only find documents where the "lastName" of 
> an "author" is "Xyz" by using the field name "author:lastName". In the 
> past (before the new API, I wrote this analyzer very complicated and 
> created StringBuffers for earch element name, where I appended the 
> text and then analyzed it for each field name again.
>
> Now I pass the XML document in my special XMLTokenStream that uses 
> STAX/DOM to retrieve the element names and contents. Each element 
> creates a new TermAttribute (with the whole contents as one term) and 
> a custom Attribute holding the reference to the current element name 
> and all previous higher level element names (the Attribute contains a 
> Stack of element names). This special Attribute is then in the 
> Tokenizer chain and only updated by the root XMLTokenStream. The next 
> filter in the chain is a WhitespaceFilter (that splits up the tokens 
> at white space) and so on to further tokenize the element contents. 
> The special element name stack attribute is untouched, but always 
> contains the current element name for later filtering. The last step 
> is using the new TeeSinkTokenFilter to index the stream into different 
> fields. The TeeSinkTokenFilter gets Sinks for each field name/element 
> name hierarchy (which are recorded before), each Sink filters the 
> Tokens using the special element stack attribute for matching tokens 
> the field is interested. By that I can simply analyze the whole XML 
> document one time and distribute the contents to various field names 
> using the additional attribute.
>
> Here an example (using the above schema), that shows all documents 
> with a title of "Evidence from Fram Strait" in the publication where 
> the dataset is attached to as supplement: 
> http://www.pangaea.de/search?q=supplementTo%3Atitle%3A%22Evidence+from+Fram+Strait%22

> (which hits only the above example). The query parser is customized 
> (not the Lucene one).
>
> The final code of this TokenStream is a little bit more complicated 
> that described here, but it gives a possible usage of the new API: 
> Annotate tokens with field identifiers to e.g. automatically put the 
> title of a document in a title field and the authors in another one 
> and so on.
>
> I hope somebody understood, what we are doing here J
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> ------------------------------------------------------------------------
>
> *From:* Shai Erera [mailto:serera@gmail.com]
> *Sent:* Monday, August 10, 2009 11:13 PM
> *To:* java-dev@lucene.apache.org
> *Subject:* Re: who clears attributes?
>
> It sounds like the 'old' API should stay a bit longer than 3.0. We'd 
> like to give more people a chance to experiment w/ the new API before 
> we claim it is the new Analysis API in Lucene. And that means that 
> more users will have to live w/ the "bit of slowness" more than what 
> is believed in this thread.
>
> I personally worry much about needing to throw away the current API. 
> I'll have a lot of code to port over and I haven't read anything so 
> far that convinces me the new API is better. I don't have any problems 
> w/ the current API today. I feel I have all the flexibility I need w/ 
> indexing fields. I use payloads, Field.Index constants, write 
> Analyzers, TokenStreams ... actually I have 0 complaints.
>
> Maybe we should follow what I seem to read from Earwin and Grant - 
> come up w/ real use cases, try to implement them w/ the current API, 
> then if it's impossible, discuss how we can make the current API more 
> adaptive. If at the end of this we'll get back to the new API, then 
> we'll at least feel better about it, and more convinced it is the way 
> to go.
>
> Hack .. maybe we'll be convinced to base the Luceue analysis on UIMA? :)
>
> Shai
>
> On Mon, Aug 10, 2009 at 11:54 PM, Uwe Schindler <uwe@thetaphi.de 
> <mailto:uwe@thetaphi.de>> wrote:
>
> > >> I have serious doubts about releasing this new API until these
> > >> performance issues are resolved and better proven out from a
> > >> usability
> > >> standpoint.
> > >
> > > I think LUCENE-1796 has fixed the performance problems, which was
> > > caused by
> > > a missing reflection-cache needed for bw compatibility. I hope to
> > > commit
> > > soon!
> > >
> > > 2.9 may be a little bit slower when you mix old and new API and do
> > > not reuse
> > > Tokenizers (but Robert is already adding reusableTokenStream to all
> > > contrib
> > > analyzers). When the backwards layer is removed completely or
> > > setOnlyUseNewAPI is enabled, there is no speed impact at all.
> > >
> >
> >
> > The Analysis features of Lucene are the single most common place where
> > people enhance Lucene.  Very few add queries, or muck with field
> > caches, but they do write their own Analyzers and TokenStreams,
> > etc.    Within that, mixing old and new is likely the most common case
> > for everyone who has made their own customizations, so a "little bit
> > slower" is something I'd rather not live with just for the sake of
> > some supposed goodness in a year or two.
>
> But because of this flexibility, we added the backwards layer. The old 
> style
> with setUseNewAPI was not flexible at all, and nobody would move his
> Tokenizers to the new API without that flexibility (maybe he uses external
> analyzer packages not yet updated).
>
> With "a little bit" I mean the cost of wrapping the old and new API is
> really minimal, it is just an if statement and a method call, hopefully
> optimized away by the JVM. In my tests the standard deviation between
> different test runs was much higher than the difference between mixing
> old/new API (on Win32), so it is not really sure, that the cost comes from
> the delegation.
>
> The only case that is really slower is (now minimized cost of creation in
> TokenStream.<init>, if you not reuse TokenStreams: Two LinkedHashMaps have
> to be created and setup). But this is not caused by the backwards layer.
>
> Uwe
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org 
> <mailto:java-dev-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-dev-help@lucene.apache.org 
> <mailto:java-dev-help@lucene.apache.org>
>


Mime
View raw message