lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: index publication articles
Date Mon, 13 May 2019 18:16:41 GMT
I’m late as well.   My suggestion is to use Solr and the Solr Tagger.   Index the terms into
a collection.  Send docs to the tagger endpoint and it’ll tag ‘em giving you the location
and terms.  


> On May 13, 2019, at 14:08, Andy Pook <> wrote:
> A little late to this party...
> Another approach is to add a custom tokenizer. This will add an extra token
> (with a special word, like "ccc") for the same position when it hits one of
> your key words or phrases. As a result you can just search for "ccc" this
> will then return all docs that contain any of your words. You also have an
> index where you can do general searches perhaps in combination with the
> special token (such as "ccc AND ufo" to find out why ufo's cause cancer :)
> At a previous gig we had a whole taxonomy of words and phrases that were
> tagged this way. Then searches could be made on concepts and abstractions
> rather than complex combinations of brackets ANDs and ORs.
>> On Fri, 29 Mar 2019 at 21:39, Morgenweck <> wrote:
>> Thanks to everyone-- because it is a set number of documents(about 1000)
>> and a set number of words (8000+) and time does not matter initially I'm
>> going to go with Regex initially.  I found a company that
>> will donate their PDF
>> extraction software and will work with me developing the Regex.  The number
>> of hits for each word will be stored as meta data for each of the
>> articles.  Since I have total control over the words and it needs to be run
>> with each word only I can run this as a job or a nightly process and save
>> the data.  Once done it's not used until a new article appears and only for
>> that one.
>> In regards to the nuclear reactor coffee maker- I loved it-- but did you
>> ever have that feeling that you are just missing something?  And what I was
>> thinking in the back of my mind is what  Lang- said.  Index the 8000
>> words.I do plan on doing this in the next step where a new researcher will
>> come to the Cancer Center and search for words that they create nrather
>> than being limited to the 8000.  Topics that they can find were other
>> researchers that work in their same type of area. That process is where
>> will shine.
>> Thank you all again
>>> On Fri, Mar 29, 2019 at 10:48 AM Jörg Lang <> wrote:
>>> Hi
>>> I wouldn't go with a regex. Because it only has a hit, if the match is
>>> 100%.
>>> Using Lucene you can assign a language analyzer in indexing the
>> documents.
>>> When doing searches for your keywords you get hits for plural/singular
>> and
>>> even verb declinations are considered.
>>> This of course at the cost, that you might get a few hits where you
>>> personally wouldn't mark it as a hit. But this is the general price of a
>>> full text search.
>>> An idea worth exploring:
>>> - Create a document with your list of 8000 terms.
>>> - Have it indexed, with all the other documents
>>> - Do a "more like this" query giving your "terms" document as input
>>> - You get a list with documents that contain similar words like the
>> source
>>> document. The most relevant documents ranked first.
>>> You can read about "moreLikeThis" here
>>> This might give you also some input.
>>> Joerg
>>> -----Ursprüngliche Nachricht-----
>>> Von: Morgenweck, William <>
>>> Gesendet: Freitag, 29. März 2019 02:48
>>> An:
>>> Betreff: index publication articles
>>> I need to ask this question because I think it might be something
>>> Lucene.Net can do but I'm not sure.  I have a list of 8,000+ words that
>> are
>>> considered Cancer Terms by the NCI
>>> I have the terms stored locally but I need to index articles that I have
>>> downloaded and count the number of times each word appears in the
>> article.
>>> The purpose for this is to determine if the article is Cancer Related.  I
>>> work for a NCI Designated Cancer Center and I need a way to analyze the
>> our
>>> Researchers publications that are Members of our Cancer Center.  I know
>>> that a slow way to do this is to loop each and every word and see if the
>>> indexof give a positive result or I have found a suggestion of creating a
>>> match criteria using Regex with all 8,000 words.
>>> But I feel that if I Index the Cancer Terms using I should be
>>> able to do the same thing but faster????
>>> If I'm totally off the mark just let me know.  I've been on the user
>> group
>>> for over 15 years and love the potential.
>>> Thanks,
>>> Bill
>>> -------------------------------------------------------------------------
>>> This message was secured via TLS by MUSC.

View raw message