lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Subqueries / multivalued fields / huge documents
Date Fri, 05 Feb 2010 16:12:14 GMT
Lucene can index large documents, but I think that your basic approach is
going to have problems with the fact that user agent strings are much too
unique to be useful.  Most of your user agent strings will match only one or
a very few users.  This is going to mean that until a user has seen the
video, you won't know their user agent string.  Exact matching is going to
be a disaster.

On the other hand, approximate matching is also be problematic because you
will have cases where, say, Firefox is good, but not on the mac with some
other strange software reflected in the user agent.

It seems like what you need is three pieces:

a) a set of feature extraction functions that find interesting combinations
of features in a user agent string.  This can be a combination of
hand-written rules and more general tokenization.  You are unlikely to find
something in the wild that satisfies your needs here.

b) a set of feature extraction functions for videos that take what is known
about a video and extract important features.  These features should
probably represent the length, encoder and settings for the video.

c) a set of learned functions that predict compatibility for some number of
videos based on features from (a).  These functions would be learned by
running video feature/user agent features/compatibility triplets for the
videos that are thought to be compatible with this function into a learning
algorithm such as Vowpal Wabbit.  You might get away with one function here
or you might need several.  You may also need to combine features into
composite features here.

d) something that classifies videos according to which compability predictor
from (c) should be used.  If you only have one compatibility function, then
this is trivial.  You might try hand splitting you videos into relatively
fine groups and then merging groups where the functions from one group
predict compatibility for the other group pretty well.

It is quite possible that you can use Lucene for step (c) because tf-idf
retrieval of examples can be an effective learning algorithm, but I really
expect that you will need to build (a), (b) and (d) for use during the
indexing process.  For text, the equivalent-ish functions are the analysers
and tokenizers.  Just as some text needs stemming or synonym expansion, you
need some analogous steps to expose the most informative aspects of your
data to whatever learning system you use.

On Fri, Feb 5, 2010 at 2:40 AM, Niclas Rothman <> wrote:

> Hi there, I'm facing a problem that im having difficulties to solve and im
> wondering if any of you could help on the right way.
> I have an index storing information about media like videos.
> Every video object has information about which browsers it is compatible
> with, e.g. video 1 is compatible with firefox, internet explorer and
> soforth.
> An example of such document can be visualized like:
> <doc>
>                <media>
>                                <id>12345</id>
>                                <title>A title</title>
>                                <description>My description</description>
>                                <useragents>
>                                                <!-- The useragents element
> can contain up to 15000 entries!!!! -->
>                                                Mozilla/5.0 (Linux; U;
> Android 1.5; en-gb; HTC Magic Build/CRB43) AppleWebKit/528.5+ (KHTML, like
> Gecko) Version/3.1.2 Mobile Safari/525.20.1
>                                                Mozilla/4.0
> SonyEricssonW910iv/R1CA Browser/NetFront/3.4 Profile/MIDP-2.1
> Configuration/CLDC-1.1 UP.Link/
>                                                NokiaN81
>                                                .
>                                                .
>                                                .n
>                                </useragents>
>                </media>
> </doc>
> Except from finding media objects that are relevant to a users search, e.g.
> where title equals "A title" I may only serve / display items that are
> compatible with the users browsers useragent, e.g. find all media objecst
> that are compatible with Firefox.
> Problems:
> 1.       One media object can contain up to 15000 useragent entries, can I
> index this with deccent performance?
> 2.       The useragent values can be "partial" or wildcarded, e.g.
> "*NokiaN81*" meaning that the media object is compatible with all browsers
> having a useragent containting "NokiaN81".
> 3.       Can I get any performance out of this, FilterQueries, RangeQueries
> how?
> Any help very much appreciated!!!!
> Regards
> Magnus

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message