lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Creating facets based on the content field
Date Mon, 23 Mar 2015 23:50:02 GMT
I wasn't talking about using NLP at query time. I was trying to convey
that perhaps NLP processing on documents at _index_ time could reduce
the number of distinct tokens you then facet over at query time.

The basic caution still remains, faceting on high-cardinality fields
is expensive, it's just a caution about trying your queries on a
corpus that's representative of your final corpus in terms of size
before deciding whether it's fast enough and will work on your
hardware.

Best,
Erick

On Mon, Mar 23, 2015 at 2:20 PM, Philippe de Rochambeau <phiroc@free.fr> wrote:
> Hi Erick,
> can you use NLP for query-time facetting? How?
> Moreover, can you use it to find keyword patterns?
> Cheers,
> Philippe
>
>
>> Le 23 mars 2015 à 18:44, Erick Erickson <erickerickson@gmail.com> a écrit
:
>>
>> Be a little careful here about memory. Faceting on high-cardinality
>> fields is a very good way to encounter OOM and/or performance
>> problems.
>>
>> But you're right, in Solr faceting is a query-time construct, it needs
>> nothing at index time. The NLP stuff can help narrow down the number
>> of unique values in the field you're faceting on.
>>
>> Best,
>> Erick
>>
>>> On Mon, Mar 23, 2015 at 9:41 AM,  <phiroc@free.fr> wrote:
>>> I just want a list of recurring words (for now.)
>>>
>>> I removed the manually-created facets from solrconfig.xml and SOLR "automagically"
created a facet list for me.
>>>
>>> But thanks for your suggestions.
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "Charlie Hull" <charlie@flax.co.uk>
>>> À: solr-user@lucene.apache.org
>>> Envoyé: Lundi 23 Mars 2015 17:26:18
>>> Objet: Re: Creating facets based on the content field
>>>
>>>> On 23/03/2015 16:08, phiroc@free.fr wrote:
>>>> Let's say that one pdf has the following contents:
>>>
>>> Aren't you thinking of Named Entity Recognition? We've used Stanford NLP
>>> for this in the past and it's quite good at People, Places and
>>> Organisations out of the box (needs tuning for other classes of
>>> entities). You can then add these entities as metadata to your document
>>> objects and index them so you can facet on them appropriately.
>>>
>>> Cheers
>>>
>>> Charlie
>>>>
>>>> "[thousands of characters] blablabla Churchill blablabla [thousands of text
characters]"
>>>>
>>>> ... and another PDF contains:
>>>>
>>>> "[thousands of characters] blablabla Gandhi [thousands of characters] Churchill
blablabla [thousands of text characters]"
>>>>
>>>> As you can see, there two PDFs contain keywords that are potential candidates
for facets (e.g. Churchill, Gandhi, ...), but I have no
>>>> way of knowing that when adding facets to the solrconfig.xml file, unless
I read all the PDFs (which will take me years) and compile a list of often-occurring words
and names.
>>>>
>>>> The fallback solution is therefore to guess the keywords, which are likely
to appear in the PDFs; e.g.:
>>>>
>>>>                                 <str name="facet.query">Aircraft</str>
>>>>                                 <str name="facet.query">Armistice</str>
>>>>                                 <str name="facet.query">Austria</str>
>>>>                                 <str name="facet.query">Bolshevik</str>
>>>>                                 <str name="facet.query">Britain</str>
>>>>                                 <str name="facet.query">British</str>
>>>>                                 <str name="facet.query">Charlie Chaplin</str>
>>>>                                 <str name="facet.query">Clemenceau</str>
>>>>                                 <str name="facet.query">Einstein</str>
>>>> ...
>>>>
>>>>
>>>> However, how can I be sure that these facets will be useful to the other
'core' users? For instance, let's say that one
>>>> user is more interested in Gandhi that Einstein: the "Einstein" facet is
therefore useless to him and a "Gandhi" facet is missing from sorlconfig.xml.
>>>>
>>>> Is there a way to dynamically generate a list of facets based on words contained
in the content field?
>>>>
>>>> Cheers,
>>>>
>>>> Philippe
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ----- Mail original -----
>>>> De: "Erik Hatcher" <erik.hatcher@gmail.com>
>>>> À: solr-user@lucene.apache.org
>>>> Envoyé: Lundi 23 Mars 2015 16:30:49
>>>> Objet: Re: Creating facets based on the content field
>>>>
>>>> Philippe - can you provide a concrete example of what you mean by creating
facets on field’s content?   Or maybe rather, what’s missing from doing &facet.field=content
currently?
>>>>
>>>>     Erik
>>>>
>>>>
>>>>
>>>>
>>>>> On Mar 23, 2015, at 10:48 AM, phiroc@free.fr wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> let's say that you haved indexed hundreds of PDFs using the following
curl command:
>>>>>
>>>>> curl -Ss -X POST 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf"
>>>>>
>>>>> The PDF's contents are now stored in core0's "content" field.
>>>>>
>>>>> I wonder how you create facets based on the field's contents, if you
don't know in advance what it contains (unless you have compiled a list of frequently-occurring
words in the PDFs, after reading them.)
>>>>>
>>>>> Many thanks.
>>>>>
>>>>> Philippe
>>>
>>>
>>> --
>>> Charlie Hull
>>> Flax - Open Source Enterprise Search
>>>
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.flax.co.uk

Mime
View raw message