Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 47278 invoked from network); 30 Jun 2008 13:52:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Jun 2008 13:52:08 -0000 Received: (qmail 41819 invoked by uid 500); 30 Jun 2008 13:52:01 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 41778 invoked by uid 500); 30 Jun 2008 13:52:01 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 41766 invoked by uid 99); 30 Jun 2008 13:52:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Jun 2008 06:52:01 -0700 X-ASF-Spam-Status: No, hits=-2.8 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [207.46.51.80] (HELO SG2EHSOBE006.bigfish.com) (207.46.51.80) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Jun 2008 13:51:08 +0000 Received: from mail230-sin-R.bigfish.com (10.3.40.3) by SG2EHSOBE006.bigfish.com (10.3.40.26) with Microsoft SMTP Server id 8.1.240.5; Mon, 30 Jun 2008 13:51:04 +0000 Received: from mail230-sin (localhost.localdomain [127.0.0.1]) by mail230-sin-R.bigfish.com (Postfix) with ESMTP id AD32F196034E for ; Mon, 30 Jun 2008 13:51:04 +0000 (UTC) X-BigFish: VPS-110(z57dlz14c3M328cM542N1418M1432R14e0Q98dR11f6O62a3Le66R7efV1447R1443R1805M179dRzzzzz2dh6bh41i42k43j61h) X-Spam-TCS-SCL: 0:0 Received: by mail230-sin (MessageSwitch) id 1214833860423636_26781; Mon, 30 Jun 2008 13:51:00 +0000 (UCT) Received: from us-voo-smtp05.internal.sungard.corp (unknown [216.83.166.46]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by mail230-sin.bigfish.com (Postfix) with ESMTP id 784D5129806D for ; Mon, 30 Jun 2008 13:50:59 +0000 (UTC) Received: from us-voo-smtp11.internal.sungard.corp ([168.162.128.53]) by us-voo-smtp05.internal.sungard.corp with Microsoft SMTPSVC(6.0.3790.3959); Mon, 30 Jun 2008 09:49:54 -0400 Received: from VOO-EXCHANGE01.internal.sungard.corp ([168.162.128.81]) by us-voo-smtp11.internal.sungard.corp with Microsoft SMTPSVC(6.0.3790.3959); Mon, 30 Jun 2008 09:49:54 -0400 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-Class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: Can you create a Field that is a copy of another Field? Date: Mon, 30 Jun 2008 09:49:53 -0400 Message-ID: <87D5FFD601E4BC488D51E45D414706C7031C716F@VOO-EXCHANGE01.internal.sungard.corp> In-Reply-To: <4868D0C2.9050500@informatics.jax.org> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Can you create a Field that is a copy of another Field? Thread-Index: AcjarPzOYWitg66wRyaqzlch16S4iwABd3bg From: To: X-OriginalArrivalTime: 30 Jun 2008 13:49:54.0125 (UTC) FILETIME=[2CD7FFD0:01C8DAB8] X-Virus-Checked: Checked by ClamAV on apache.org Matthew, It has to do with the fact that we're trying to represent these Property = entitities hierarchically. We are displaying them in a tree structure, = similar to the way Windows Explorer displays directories and files your = file system. E.g. all the states would be at the root level. If you = expanded a particular state you would see all the cities in that state, = etc. =20 If the user does a search we want to filter or "reduce" the tree. E.g. = imagine you search on the term 'Smith'. Well since it's a safe bet to = assume that there's somebody with the last name of Smith in all fifty = states, then all fifty states would show up at the root level. On the = other hand, suppose there's one guy in the whole country named with the = last name of 'Fleebleflabble' and he lives in Michigan. If I search on = that term I would expect only one state, namely Michigan to show up at = the root level. Each level in the heirarchy is filtered by the search = specified terms in this way. Searches are not limited to people's names though. We want to reduce = the tree by matches on ANY field in the Properties from 'State' to = 'Name'. So for example, a seach on 'Smith' would return matches for = everybody that lived in a city named 'Smith City' or on a street named = 'Smith Avenue', etc. This doesn't make a lot of sense for people and addresses, I admit. I = just used that as an easy follow example. But it does make sense for = the data we're storing. And BTW, maybe you can see a few holes in this = approach. There's a bit more to it than I've described above. We have = had to get a little creative with other documents and fields in order = for it work correctly. I'd be happy to elaborate if anybody is = interested. There may be better ways to do it. Like I said I'm fairly = new to Lucene. Was just trying to keep it simple. -- Bill=20 -----Original Message----- From: Matthew Hall [mailto:mhall@informatics.jax.org]=20 Sent: Monday, June 30, 2008 8:26 AM To: java-user@lucene.apache.org Subject: Re: Can you create a Field that is a copy of another Field? Sorry, didn't get this until this morning. Yes, both fields should be indexed and searchable, though the data_type=20 one should likely be untokenized.=20 Data should be indexed and tokenized with whatever appropriate Analyzer=20 works for your data. As for what your indexing, may I ask why you are doing it like that? I would have thought indexing each property seperately (a seperate doc)=20 would have been sufficient for your needs, but if you can explain a bit=20 more about your situation perhaps I can be more helpful on this matter? Matt Bill.Chesky@sungard.com wrote: > Hmmm, I think maybe I am missing something. In your design is the = 'data' field indexed, i.e. searchable? Or is it an unindexed, stored = field? =20 > > I was thinking that both 'data' and 'data_type' were indexed and = searchable. =20 > > Maybe the confusion stems from the fact that for the Document = corresponding to "State=3DCalifornia", we're not just indexing on the = token 'California'. We're indexing on all the tokens from all the = Properties in the set of Properties corresponding to a person's address. = In my original example this would be: California, Sacremento, 94203, = South, Main, 1234, Joe and Smith. > > For the 'data_type' field I was thinking you were saying we'd index on = a single token, namely 'State' (or whatever the left-hand side is). > > Does that make sense? > -- > Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 = Oak Valley Drive * Ann Arbor, MI 48103 > Tel 734-332-4405 * Fax 734-332-4440 * bill.chesky@sungard.com > www.sungard.com/energy=20 > > > -----Original Message----- > From: Matthew Hall [mailto:mhall@informatics.jax.org]=20 > Sent: Friday, June 27, 2008 3:33 PM > To: java-user@lucene.apache.org > Subject: Re: Can you create a Field that is a copy of another Field? > > Yup, you're pretty much there. > > The only part I'm a bit confused about is what you've said in your = data=20 > field there, > > I'm thinking you mean that for the data_type: "State", you would have=20 > the data entry of "California", right? > > If so, then yup, you are spot on ^^ > > We use this technique all the time on our side, and its helped=20 > considerably. We then use the db_key to reference into a display time = > cache that holds all of the display information for the underlying=20 > object that we would ever want to present to the user. This allows = our=20 > search time index to be very concise, and as a result nearly every=20 > search we hit it with is subsecond, which is a nice place to be ^^ > > Matt > > Bill.Chesky@sungard.com wrote: > =20 >> Matthew, >> >> Thanks for the reply. This looks very interesting. If I'm = understanding correctly your db_key, data and data_type are Fields = within the Document, correct? So is this how you envision it? >> >> Document: State=3DCalifornia >> Field: 'db_key'=3D'1395' (primary key into relational table, = correct?) >> Field: 'data' indexed by 'California', 'Sacremento', '94203', etc. >> Field: 'data_type' indexed by 'State' >> >> Document: City=3DSacremento >> Field: 'db_key'=3D'2405'=20 >> Field: 'data' indexed by 'California', 'Sacremento', '94203', etc. >> Field: 'data_type' indexed by 'City' >> >> Then my query for all Properties would be: >> >> +data:South >> >> My query for only 'City' Properties would be: >> >> +data:South +data_type:City >> >> Is that right? >> >> I think that would work. Very nice. Thank you very much!!!! >> -- >> Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 = Oak Valley Drive * Ann Arbor, MI 48103 >> Tel 734-332-4405 * Fax 734-332-4440 * bill.chesky@sungard.com >> www.sungard.com/energy=20 >> >> >> -----Original Message----- >> From: Matthew Hall [mailto:mhall@informatics.jax.org]=20 >> Sent: Friday, June 27, 2008 11:49 AM >> To: java-user@lucene.apache.org >> Subject: Re: Can you create a Field that is a copy of another Field? >> >> I'm not sure if this is helpful, but I do something VERY similar to = this=20 >> in my project. >> >> So, for the example you are citing I would design my index as = follows: >> >> db_key, data, data_type >> >> Where the data_type is some sort of value representing the thing = that's=20 >> on the left hand side of your property relationship there. >> >> So, then in order to satisfy your search, the queries become quite = simple: >> >> The search for everything simply searches against the data field in = this=20 >> index, wheras the search for a specific data_type + searchterm = becomes a=20 >> simple boolean query, that has a MUST clause for the data_type value. >> >> As an even BETTER bonus, this will then mean that all of your = searchable=20 >> values will now have relevance to each other at scoring time, which = is=20 >> quite useful in the long run. >> >> Hope this helps you out, >> >> Matt >> >> Bill.Chesky@sungard.com wrote: >> =20 >> =20 >>> Grant, >>> >>> Thanks for the reply. What we're trying to do is kind of esoteric = and hard to explain without going into a lot of gory details so I was = trying to keep it simple. But I'll try to summarize. >>> >>> We're trying to index entities in a relational database. One of the = entities we're trying to index is something called a Property. Think of = a Property kind of like the java.util.Properties class, i.e. a = name/value pair. So some examples of Properties might be: >>> >>> State=3DCalifornia >>> City=3DSacremento >>> ZipCode=3D94203 >>> StreetName=3DSouth Main >>> StreetNumber=3D1234 >>> Name=3DJoe Smith >>> >>> Etc., etc. >>> >>> (Note: this isn't the type of data we're storing... just trying to = keep it simple.) >>> >>> Imagine that the above list represents the the set of Properties = that specify the address for a single person, Joe Smith. Each Property = in the set will be indexed by the values on the right-hand side of all = the other name/value pairs in the set, i.e.: California, Sacremento, = 94203, South, Main, 1234, Joe and Smith. >>> >>> There are two types of queries that we want to do. =20 >>> 1) retrieve every Property matching the specified search terms, = regardless of its left-hand side. For this we want to create a field in = EVERY Document called "keywords" and index it by the right-hand side = values as described above. >>> 2) retrieve every Property with a given left-hand side that matches = the specified search terms. For example, find all the 'City' Properties = that match the term 'South'. For this we want to create a field with = the name of the left-hand side (e.g. State, City, ZipCode, etc.) but = only in those Documents that correspond to a Property with that = left-hand side. Again this field will be indexed by the right-hand side = values as described above. >>> >>> So a couple of examples from the above list might look something = like: >>> >>> Document: State=3DCalifornia >>> Field: 'keywords' indexed by 'California', 'Sacremento', '94203', = etc. >>> Field: 'State' indexed by 'California', 'Sacremento', '94203', = etc. >>> >>> Document: City=3DSacremento >>> Field: 'keywords' indexed by 'California', 'Sacremento', '94203', = etc. >>> Field: 'City' indexed by 'California', 'Sacremento', '94203', etc. >>> >>> Now if I'm interested in all the Properties that match the word = "South", I search the index on the "keywords" field for the term = "South". This will return both documents above. =20 >>> >>> But if I'm only interested in any 'City' Properties that match the = term 'South' I search the index on the "City" field for the term = "South". This will only return the 'City=3DSacremento' document above = because it's the only Document of the two that even has a 'City' field = in it. >>> >>> But in any case, the 'State' field and the 'City' field are indexed = exactly the same way as the 'keywords' field. Which is why I was = wondering if there was a way to just create these fields as copies of = the 'keywords' field. >>> >>> Here is a code sample where I'm creating the index. We're using = Hibernate search to search the indexes, thus the "id" and = "_hibernate_class" fields. >>> >>> Query q =3D em.createQuery("select p from Property p"); >>> =20 >>> List properties =3D q.getResultList(); >>> =20 >>> for (Property p : properties) >>> { >>> // Indexing property. >>> Document doc =3D new Document(); >>> doc.add(new Field("id",=20 >>> Integer.toString(p.getId()),=20 >>> Field.Store.YES,=20 >>> Field.Index.UN_TOKENIZED)); >>> doc.add(new Field("_hibernate_class",=20 >>> Property.class.getCanonicalName(),=20 >>> Field.Store.YES,=20 >>> Field.Index.UN_TOKENIZED)); >>> TokenStream tokenStream =3D new PropertyTokenStream(p); >>> doc.add(new Field("keywords", tokenStream)); >>> propertyIndexWriter.addDocument(doc); >>> tokenStream.close(); =20 >>> // Here is where I would like to add the second field that is a = copy >>> // of the "keywords" field just created above. Note: the call >>> // p.getCharacteristic().getName() is getting the name of the=20 >>> // left-hand side of the Property as described above. >>> TokenStream tokenStream =3D new PropertyTokenStream(p); >>> doc.add(new Field(p.getCharacteristic().getName(), = tokenStream)); >>> propertyIndexWriter.addDocument(doc); >>> tokenStream.close(); >>> } >>> >>> Hope that clears it up. =20 >>> >>> BTW, in case this seems like a strange way to index things, I will = also add that we are doing it this way in order to impose a heirarchical = structure on Properties. So my example above should really look like = this: >>> >>> State=3DCalifornia >>> City=3DSacremento >>> ZipCode=3D94203 >>> StreetName=3DSouth Main >>> StreetNumber=3D1234 >>> Name=3DJoe Smith >>> >>> Use your imagination to visualize what the tree might look like with = millions of peoples' addresses. Now imagine trying to tokenize the = Document corresponding to "State=3DCalifornia". Each path thru the tree = from root (State) to leaf (Name) represents a set of Properties that is = used to index the "keywords" field in the "State=3DCalifornia" document. = In other words it takes a long time to index. This is why I'm looking = for a way to just copy one field to another. >>> >>> There is a lot more to our design to facilitate this hierarchical = structure but this is probably more than you wanted to know. :) >>> >>> thanks in advance, >>> -- >>> Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 = Oak Valley Drive * Ann Arbor, MI 48103 >>> Tel 734-332-4405 * Fax 734-332-4440 * bill.chesky@sungard.com >>> www.sungard.com/energy=20 >>> >>> >>> -----Original Message----- >>> From: Grant Ingersoll [mailto:gsingers@apache.org]=20 >>> Sent: Friday, June 27, 2008 7:26 AM >>> To: java-user@lucene.apache.org >>> Subject: Re: Can you create a Field that is a copy of another Field? >>> >>> >>> On Jun 27, 2008, at 12:01 AM, = >> > wrote: >>> >>> =20 >>> =20 >>> =20 >>>> Hello Lucene Gurus, >>>> >>>> >>>> >>>> I'm new to Lucene so sorry if this question basic or na=EFve. >>>> >>>> >>>> >>>> I have a Document to which I want to add a Field named, say, "foo" = >>>> that is tokenized, indexed and unstored. I am using the =20 >>>> "Field(String name, TokenStream tokenStream)" constructor to create = =20 >>>> it. The TokenStream may take a fairly long time to return all its = >>>> tokens. >>>> >>>> =20 >>>> =20 >>>> =20 >>> Can you share some code here? What's the reasoning behind using it = >>> (not saying it's wrong, just wondering what led you to it)? Are you = =20 >>> just loading it up from a file, string or something or do you have =20 >>> another reason? >>> >>> >>> =20 >>> =20 >>> =20 >>>> Now for querying reasons I want to add another Field named, say, =20 >>>> "bar", that is tokenized and indexed in exactly the same way as =20 >>>> "foo". I could just pass it the same TokenStream that I used to =20 >>>> create "foo" but since it takes so long to return all its tokens, I = =20 >>>> was wondering if there is a way to say, create "bar" as a copy of =20 >>>> "foo". I looked thru the javadoc but didn't see anything. >>>> >>>> >>>> =20 >>>> =20 >>>> =20 >>> By exactly the same, do you really mean exactly the same? What's = the =20 >>> point of that? What are the "querying reasons"? >>> >>> You may want to look at the TeeTokenFilter and the SinkTokenizer, = but =20 >>> I guess I'd like to know more about what's going on before fully =20 >>> recommending anything. >>> >>> >>> =20 >>> =20 >>> =20 >>>> Is this possible in Lucene or do I just have to bite the bullet =20 >>>> build the new Field using the same TokenStream again? >>>> >>>> -- >>>> Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 = =20 >>>> Oak Valley Drive * Ann Arbor, MI 48103 >>>> Tel 734-332-4405 * Fax 734-332-4440 * bill.chesky@sungard.com = >>> =20 >>>> www.sungard.com/energy >>>> >>>> >>>> >>>> =20 >>>> =20 >>>> =20 >>> -------------------------- >>> Grant Ingersoll >>> http://www.lucidimagination.com >>> >>> Lucene Helpful Hints: >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance >>> http://wiki.apache.org/lucene-java/LuceneFAQ >>> >>> >>> >>> >>> >>> >>> >>> >>> = --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>> >>> >>> = --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>> =20 >>> =20 >>> =20 >> =20 >> =20 > > =20 --=20 Matthew Hall Software Engineer Mouse Genome Informatics mhall@informatics.jax.org (207) 288-6012 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org