Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Content-Class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Can you create a Field that is a copy of another Field?
Date: Mon, 30 Jun 2008 09:49:53 -0400
Message-ID: 
 <87D5FFD601E4BC488D51E45D414706C7031C716F@VOO-EXCHANGE01.internal.sungard.corp>
In-Reply-To: <4868D0C2.9050500@informatics.jax.org>
Thread-Topic: Can you create a Field that is a copy of another Field?
Thread-Index: AcjarPzOYWitg66wRyaqzlch16S4iwABd3bg
From: <Bill.Chesky@sungard.com>
To: <java-user@lucene.apache.org>

Matthew,

It has to do with the fact that we're trying to represent these Property =
entitities hierarchically.  We are displaying them in a tree structure, =
similar to the way Windows Explorer displays directories and files your =
file system.  E.g. all the states would be at the root level.  If you =
expanded a particular state you would see all the cities in that state, =
etc. =20

If the user does a search we want to filter or "reduce" the tree.  E.g. =
imagine you search on the term 'Smith'.  Well since it's a safe bet to =
assume that there's somebody with the last name of Smith in all fifty =
states, then all fifty states would show up at the root level.  On the =
other hand, suppose there's one guy in the whole country named with the =
last name of 'Fleebleflabble' and he lives in Michigan.  If I search on =
that term I would expect only one state, namely Michigan to show up at =
the root level.  Each level in the heirarchy is filtered by the search =
specified terms in this way.

Searches are not limited to people's names though.  We want to reduce =
the tree by matches on ANY field in the Properties from 'State' to =
'Name'.  So for example, a seach on 'Smith' would return matches for =
everybody that lived in a city named 'Smith City' or on a street named =
'Smith Avenue', etc.

This doesn't make a lot of sense for people and addresses, I admit.  I =
just used that as an easy follow example.  But it does make sense for =
the data we're storing.  And BTW, maybe you can see a few holes in this =
approach.  There's a bit more to it than I've described above.  We have =
had to get a little creative with other documents and fields in order =
for it work correctly.  I'd be happy to elaborate if anybody is =
interested.  There may be better ways to do it.  Like I said I'm fairly =
new to Lucene.  Was just trying to keep it simple.

--
Bill=20

-----Original Message-----
From: Matthew Hall [mailto:mhall@informatics.jax.org]=20
Sent: Monday, June 30, 2008 8:26 AM
To: java-user@lucene.apache.org
Subject: Re: Can you create a Field that is a copy of another Field?

Sorry, didn't get this until this morning.

Yes, both fields should be indexed and searchable, though the data_type=20
one should likely be untokenized.=20

Data should be indexed and tokenized with whatever appropriate Analyzer=20
works for your data.

As for what your indexing, may I ask why you are doing it like that?

I would have thought indexing each property seperately (a seperate doc)=20
would have been sufficient for your needs, but if you can explain a bit=20
more about your situation perhaps I can be more helpful on this matter?

Matt

Bill.Chesky@sungard.com wrote:
> Hmmm, I think maybe I am missing something.  In your design is the =
'data' field indexed, i.e. searchable?  Or is it an unindexed, stored =
field? =20
>
> I was thinking that both 'data' and 'data_type' were indexed and =
searchable. =20
>
> Maybe the confusion stems from the fact that for the Document =
corresponding to "State=3DCalifornia", we're not just indexing on the =
token 'California'.  We're indexing on all the tokens from all the =
Properties in the set of Properties corresponding to a person's address. =
 In my original example this would be: California, Sacremento, 94203, =
South, Main, 1234, Joe and Smith.
>
> For the 'data_type' field I was thinking you were saying we'd index on =
a single token, namely 'State' (or whatever the left-hand side is).
>
> Does that make sense?
> --
> Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 =
Oak Valley Drive * Ann Arbor, MI 48103
> Tel 734-332-4405 * Fax 734-332-4440 * bill.chesky@sungard.com
>  www.sungard.com/energy=20
>
>
> -----Original Message-----
> From: Matthew Hall [mailto:mhall@informatics.jax.org]=20
> Sent: Friday, June 27, 2008 3:33 PM
> To: java-user@lucene.apache.org
> Subject: Re: Can you create a Field that is a copy of another Field?
>
> Yup, you're pretty much there.
>
> The only part I'm a bit confused about is what you've said in your =
data=20
> field there,
>
> I'm thinking you mean that for the data_type: "State", you would have=20
> the data entry of "California", right?
>
> If so, then yup, you are spot on ^^
>
> We use this technique all the time on our side, and its helped=20
> considerably.  We then use the db_key to reference into a display time =

> cache that holds all of the display information for the underlying=20
> object that we would ever want to present to the user.  This allows =
our=20
> search time index to be very concise, and as a result nearly every=20
> search we hit it with is subsecond, which is a nice place to be ^^
>
> Matt
>
> Bill.Chesky@sungard.com wrote:
>  =20
>> Matthew,
>>
>> Thanks for the reply.  This looks very interesting.  If I'm =
understanding correctly your db_key, data and data_type are Fields =
within the Document, correct?  So is this how you envision it?
>>
>> Document: State=3DCalifornia
>>    Field: 'db_key'=3D'1395' (primary key into relational table, =
correct?)
>>    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
>>    Field: 'data_type' indexed by 'State'
>>
>> Document: City=3DSacremento
>>    Field: 'db_key'=3D'2405'=20
>>    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
>>    Field: 'data_type' indexed by 'City'
>>
>> Then my query for all Properties would be:
>>
>> 	+data:South
>>
>> My query for only 'City' Properties would be:
>>
>> 	+data:South +data_type:City
>>
>> Is that right?
>>
>> I think that would work.  Very nice.  Thank you very much!!!!
>> --
>> Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 =
Oak Valley Drive * Ann Arbor, MI 48103
>> Tel 734-332-4405 * Fax 734-332-4440 * bill.chesky@sungard.com
>>  www.sungard.com/energy=20
>>
>>
>> -----Original Message-----
>> From: Matthew Hall [mailto:mhall@informatics.jax.org]=20
>> Sent: Friday, June 27, 2008 11:49 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Can you create a Field that is a copy of another Field?
>>
>> I'm not sure if this is helpful, but I do something VERY similar to =
this=20
>> in my project.
>>
>> So, for the example you are citing I would design my index as =
follows:
>>
>> db_key, data, data_type
>>
>> Where the data_type is some sort of value representing the thing =
that's=20
>> on the left hand side of your property relationship there.
>>
>> So, then in order to satisfy your search, the queries become quite =
simple:
>>
>> The search for everything simply searches against the data field in =
this=20
>> index, wheras the search for a specific data_type + searchterm =
becomes a=20
>> simple boolean query, that has a MUST clause for the data_type value.
>>
>> As an even BETTER bonus, this will then mean that all of your =
searchable=20
>> values will now have relevance to each other at scoring time, which =
is=20
>> quite useful in the long run.
>>
>> Hope this helps you out,
>>
>> Matt
>>
>> Bill.Chesky@sungard.com wrote:
>>  =20
>>    =20
>>> Grant,
>>>
>>> Thanks for the reply.  What we're trying to do is kind of esoteric =
and hard to explain without going into a lot of gory details so I was =
trying to keep it simple.  But I'll try to summarize.
>>>
>>> We're trying to index entities in a relational database.  One of the =
entities we're trying to index is something called a Property.  Think of =
a Property kind of like the java.util.Properties class, i.e. a =
name/value pair. So some examples of Properties might be:
>>>
>>> State=3DCalifornia
>>> City=3DSacremento
>>> ZipCode=3D94203
>>> StreetName=3DSouth Main
>>> StreetNumber=3D1234
>>> Name=3DJoe Smith
>>>
>>> Etc., etc.
>>>
>>> (Note: this isn't the type of data we're storing... just trying to =
keep it simple.)
>>>
>>> Imagine that the above list represents the the set of Properties =
that specify the address for a single person, Joe Smith.  Each Property =
in the set will be indexed by the values on the right-hand side of all =
the other name/value pairs in the set, i.e.: California, Sacremento, =
94203, South, Main, 1234, Joe and Smith.
>>>
>>> There are two types of queries that we want to do. =20
>>> 1) retrieve every Property matching the specified search terms, =
regardless of its left-hand side.  For this we want to create a field in =
EVERY Document called "keywords" and index it by the right-hand side =
values as described above.
>>> 2) retrieve every Property with a given left-hand side that matches =
the specified search terms.  For example, find all the 'City' Properties =
that match the term 'South'.  For this we want to create a field with =
the name of the left-hand side (e.g. State, City, ZipCode, etc.) but =
only in those Documents that correspond to a Property with that =
left-hand side.  Again this field will be indexed by the right-hand side =
values as described above.
>>>
>>> So a couple of examples from the above list might look something =
like:
>>>
>>> Document: State=3DCalifornia
>>>   Field: 'keywords' indexed by 'California', 'Sacremento', '94203', =
etc.
>>>   Field: 'State' indexed by 'California', 'Sacremento', '94203', =
etc.
>>>
>>> Document: City=3DSacremento
>>>   Field: 'keywords' indexed by 'California', 'Sacremento', '94203', =
etc.
>>>   Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.
>>>
>>> Now if I'm interested in all the Properties that match the word =
"South", I search the index on the "keywords" field for the term =
"South".  This will return both documents above. =20
>>>
>>> But if I'm only interested in any 'City' Properties that match the =
term 'South' I search the index on the "City" field for the term =
"South".  This will only return the 'City=3DSacremento' document above =
because it's the only Document of the two that even has a 'City' field =
in it.
>>>
>>> But in any case, the 'State' field and the 'City' field are indexed =
exactly the same way as the 'keywords' field.  Which is why I was =
wondering if there was a way to just create these fields as copies of =
the 'keywords' field.
>>>
>>> Here is a code sample where I'm creating the index.  We're using =
Hibernate search to search the indexes, thus the "id" and =
"_hibernate_class" fields.
>>>
>>> Query q =3D em.createQuery("select p from Property p");
>>>            =20
>>> List<Property> properties =3D q.getResultList();
>>>    =20
>>> for (Property p : properties)
>>> {
>>>     // Indexing property.
>>>     Document doc =3D new Document();
>>>     doc.add(new Field("id",=20
>>>                        Integer.toString(p.getId()),=20
>>>                        Field.Store.YES,=20
>>>                        Field.Index.UN_TOKENIZED));
>>>     doc.add(new Field("_hibernate_class",=20
>>>                       Property.class.getCanonicalName(),=20
>>>                       Field.Store.YES,=20
>>>                       Field.Index.UN_TOKENIZED));
>>>     TokenStream tokenStream =3D new PropertyTokenStream(p);
>>>     doc.add(new Field("keywords", tokenStream));
>>>     propertyIndexWriter.addDocument(doc);
>>>     tokenStream.close();   =20
>>>     // Here is where I would like to add the second field that is a =
copy
>>>     // of the "keywords" field just created above.  Note: the call
>>>     // p.getCharacteristic().getName() is getting the name of the=20
>>>     // left-hand side of the Property as described above.
>>>     TokenStream tokenStream =3D new PropertyTokenStream(p);
>>>     doc.add(new Field(p.getCharacteristic().getName(), =
tokenStream));
>>>     propertyIndexWriter.addDocument(doc);
>>>     tokenStream.close();
>>> }
>>>
>>> Hope that clears it up. =20
>>>
>>> BTW, in case this seems like a strange way to index things, I will =
also add that we are doing it this way in order to impose a heirarchical =
structure on Properties.  So my example above should really look like =
this:
>>>
>>> State=3DCalifornia
>>>     City=3DSacremento
>>>         ZipCode=3D94203
>>>             StreetName=3DSouth Main
>>>                 StreetNumber=3D1234
>>>                     Name=3DJoe Smith
>>>
>>> Use your imagination to visualize what the tree might look like with =
millions of peoples' addresses.  Now imagine trying to tokenize the =
Document corresponding to "State=3DCalifornia".  Each path thru the tree =
from root (State) to leaf (Name) represents a set of Properties that is =
used to index the "keywords" field in the "State=3DCalifornia" document. =
 In other words it takes a long time to index.  This is why I'm looking =
for a way to just copy one field to another.
>>>
>>> There is a lot more to our design to facilitate this hierarchical =
structure but this is probably more than you wanted to know. :)
>>>
>>> thanks in advance,
>>> --
>>> Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 =
Oak Valley Drive * Ann Arbor, MI 48103
>>> Tel 734-332-4405 * Fax 734-332-4440 * bill.chesky@sungard.com
>>>  www.sungard.com/energy=20
>>>
>>>
>>> -----Original Message-----
>>> From: Grant Ingersoll [mailto:gsingers@apache.org]=20
>>> Sent: Friday, June 27, 2008 7:26 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Can you create a Field that is a copy of another Field?
>>>
>>>
>>> On Jun 27, 2008, at 12:01 AM, <Bill.Chesky@sungard.com> =
<Bill.Chesky@sungard.com=20
>>>  > wrote:
>>>
>>>  =20
>>>    =20
>>>      =20
>>>> Hello Lucene Gurus,
>>>>
>>>>
>>>>
>>>> I'm new to Lucene so sorry if this question basic or na=EFve.
>>>>
>>>>
>>>>
>>>> I have a Document to which I want to add a Field named, say, "foo"  =

>>>> that is tokenized, indexed and unstored.  I am using the =20
>>>> "Field(String name, TokenStream tokenStream)" constructor to create =
=20
>>>> it.  The TokenStream may take a fairly long time to return all its  =

>>>> tokens.
>>>>
>>>>    =20
>>>>      =20
>>>>        =20
>>> Can you share some code here?  What's the reasoning behind using it  =

>>> (not saying it's wrong, just wondering what led you to it)?  Are you =
=20
>>> just loading it up from a file, string or something or do you have =20
>>> another reason?
>>>
>>>
>>>  =20
>>>    =20
>>>      =20
>>>> Now for querying reasons I want to add another Field named, say, =20
>>>> "bar", that is tokenized and indexed in exactly the same way as =20
>>>> "foo".  I could just pass it the same TokenStream that I used to =20
>>>> create "foo" but since it takes so long to return all its tokens, I =
=20
>>>> was wondering if there is a way to say, create "bar" as a copy of =20
>>>> "foo".  I looked thru the javadoc but didn't see anything.
>>>>
>>>>
>>>>    =20
>>>>      =20
>>>>        =20
>>> By exactly the same, do you really mean exactly the same?  What's =
the =20
>>> point of that?  What are the "querying reasons"?
>>>
>>> You may want to look at the TeeTokenFilter and the SinkTokenizer, =
but =20
>>> I guess I'd like to know more about what's going on before fully =20
>>> recommending anything.
>>>
>>>
>>>  =20
>>>    =20
>>>      =20
>>>> Is this possible in Lucene or do I just have to bite the bullet =20
>>>> build the new Field using the same TokenStream again?
>>>>
>>>> --
>>>> Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 =
=20
>>>> Oak Valley Drive * Ann Arbor, MI 48103
>>>> Tel 734-332-4405 * Fax 734-332-4440 * bill.chesky@sungard.com =
<mailto:bill.chesky@sungard.com=20
>>>>    =20
>>>> www.sungard.com/energy <blocked::http://www.sungard.com/energy>
>>>>
>>>>
>>>>
>>>>    =20
>>>>      =20
>>>>        =20
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> =
---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>> =
---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>  =20
>>>    =20
>>>      =20
>>  =20
>>    =20
>
>  =20

--=20
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org