lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Pine <general_n...@yahoo.com>
Subject faceting and categorizing on color?
Date Fri, 16 Jun 2006 22:29:07 GMT
Hey Everyone,

I have been reading several threads about facet counts
and category counts and was wondering if/how they
might apply to searching for colors. Let's say that
there is a Lucene index where each document
corresponds to an image. In addition, each document
contains the top 10 most frequently occurring colors 
(hex values) in its associated image along with some
other axillary data. So a few example documents might
look like this:

Field	Value
IMAGE	1
COLORS	F00000 FF0000 FFF000 FFFF00 FFFFF0 FFFFFF
E00000 EE0000 EEE000 EEEE00
PIXELS	10000
FORMAT	JPG

Field	Value
IMAGE	2
COLORS	F00000 F0F000 F0FF00 F0FFF0 F0FFFF FFFFFF
E00000 E0E000 E0EE00 E0EEE0
PIXELS	50000
FORMAT	JPG

Field	Value
IMAGE	3
COLORS	F00000 F00F00 F00FF0 F00FFF FFFFFF E00000
E0E000 E0EE00 E0EEE0 E0EEEE
PIXELS	1000
FORMAT	PNG

Field	Value
IMAGE	4
COLORS	F00000 F00F00 F00FF0 F00FFF CCCCCC E00000
E0E000 E0EE00 E0EEE0 E0EEEE
PIXELS	20000
FORMAT	GIF

If I setup the COLORS field such that it is of type
Field.Keyword, then I can construct a query
COLORS:FFFFFF with a StandardAnalyzer that returns the
first 3 documents. The additional bit of information
I'd like to calculate/return is the list of colors the
images share, in order of most frequent to least
frequent i.e.

F00000, 3 occurrences
E00000, 3 occurrences
E0E000, 2 occurrences
E0EE00, 2 occurrences
FF0000, 1 occurrence
...

I could use a custom HitCollector to collect and order
this information, but it would cause a sizable
performance hit.This thread from last year:
http://www.nabble.com/forum/ViewTopic.jtp?topic=266441&tview=threaded

Seems to suggest the creation of a String[] of colors
and then iterate over it using a QueryFilter, AND'ing
the result sets bits with the filtered bits, sorting
based on cardinality. That works in my tests with a
limited number of colors, but my String[] of colors
could contain 16.7M values, and most certainly will
contain at least several million.

I read up on BitSets and Solr's use of them as well in
this thread:
http://www.gossamer-threads.com/lists/lucene/java-user/35427?do=post_view_threaded

But I think that the solutions suggested assume
several hundred facets i.e. price, manufacturer, size,
etc. each one having a few hundred possible values.
Those would make sense so that I could provide
filtering based on FORMAT or number of PIXELS, but
again, don't seem to scale in the colors example.

Am I mistaken in thinking that my problem is similar
to faceting/categorization/rollup, just on a different
scale? I suppose I could also structure my document
such that it looked like this:

Field	Value
IMAGE	1
COLOR	F00000
COLOR	FF0000
COLOR	FFF000
COLOR	FFFF00
COLOR	FFFFF0
COLOR	FFFFFF
COLOR	E00000
COLOR	EE0000
COLOR	EEE000
COLOR	EEEE00
PIXELS	10000
FORMAT	JPG

That would allow me to do something with TermEnums,
but I don't see how it would be any better than using
a HitCollector. Does the above document structure make
things any better/worse/the same compared to the
initial document structure? If the answer to my
question is trivial would anything change if I wanted
to weight the colors. For example, if the initial
document looked like:

Field	Value
IMAGE	1
COLORS	10-F00000 6-FF0000 3-FFF000 9-FFFF00 7-FFFFF0
1-FFFFFF 2-E00000 4-EE0000 8-EEE000 5-EEEE00
PIXELS	10000
FORMAT	JPG

That would be interpreted as FFFFFF is the most
dominant of the top 10 most dominant colors and F00000
is the least. Would that be a terrible document
structure?

Thanx in advance, especially for wading through this
entire post :o)

JAMES

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message