lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Pine <general_n...@yahoo.com>
Subject Re: faceting and categorizing on color?
Date Tue, 20 Jun 2006 17:32:45 GMT
> First off, let me clear up somethign regarding your
> index field structure,
> you mentioned that you currently have documents that
> look like this...
> 
> : IMAGE	1
> : COLORS	F00000 FF0000 FFF000 FFFF00 FFFFF0 FFFFFF
> : E00000 EE0000 EEE000 EEEE00
> 
> If you are indexing it as Field.Keyword and you can
> query for
> COLORS:FFFFFF and get results, then either you are
> only only getting
> documents that are 100% white, or when you indexed
> the Documents you added
> eeach collor as a seperate field instance 

I thought that having: F00000 FF0000 FFFFFF FFFF00...
in one field and then searching for FFFFFF in it would
match all documents that contain that "word" so I
would be finding documents that could be < 100% white.
In either case, I don't actually have a document
structure yet ;o) It seems though that you are
recommending this document structure:

Field Value
IMAGE 1
COLOR FF0000
COLOR 00FF00 
COLOR 0000FF
COLOR 808080


And then if this image were 50% Red, 25% Green, 12.5%
Blue and 12.5% Grey the document would look like this:

Field Value
IMAGE 1
COLOR FF0000
COLOR FF0000
COLOR FF0000
COLOR FF0000
COLOR 00FF00 
COLOR 00FF00 
COLOR 0000FF
COLOR 808080

Is that correct? I wrote a unit test to compare term
document frequencies while iterating through the
TermEnum object returned by IndexReader.terms() and
the counts were equal. I guess I am still not clear on
what the differences/advantages/disadvantages are
between the above an document and one that like this:

Field Value
IMAGE 1
COLORS FF0000 FF0000 FF0000 FF0000 00FF00 00FF00
0000FF 808080

In fact with all the color values stored in a single
field, if I did some sort of pixel region
mapping/resolution reduction then wouldn't it lend
itself to Span/Fuzzy queries so I could not only
search for images which have both red and green, but
rank them based on how close the colors are to each
other? Would I lose that capability if I stored each
color in its own field? So perhaps a more general
question is when is it better to collapse a bunch of
words into a single field vs. spread them out over
many fields, all of which have the same field name?

> If you index both the full color codes as well as
> just the most
> significant hex characters from the RGB code in
> seperate fields, you can
> do "coarse" faceting on one field, and "refined"
> faceting on the other...
> 
> IMAGE:  1
> COLOR: FFFFFE FFA3B4 2287C3 773442 666BED
> COARSE_COLOR: FFF FAB 28C 734 66E

That's a great idea and seems way more straightforward
than the RGB/HSV etc. distance calculation algorithms
I've been reading about :o) I'll have to run some
tests to see how accurate that reduction appears to
people.

> Usually, you need to do one pass over your values
> and one pass over
> your documents -- with a HitCollector you do a pass
> over your values first
> Alternately, you can do one pass over
> your documents building
> a BitSet of the docs that match, and *then* do a
> pass over the values

Huh, perhaps I don't understand the HitCollector
fully. Are you saying that if I have an index with 100
documents, each of which have a color field (let's say
25 of the documents have FFFFFF in the color field)
and I do a search for FFFFFF...using a HitCollector
I'm iterating over all 100 documents while extracting
values whereas with the regular Hits based search I
would only be iterating over 25?

Thanx again.

JAMES

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message