lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <erik.hatc...@gmail.com>
Subject Re: index publication articles
Date Mon, 13 May 2019 18:16:41 GMT
I’m late as well.   My suggestion is to use Solr and the Solr Tagger.   Index the terms into
a collection.  Send docs to the tagger endpoint and it’ll tag ‘em giving you the location
and terms.  

   Erik

> On May 13, 2019, at 14:08, Andy Pook <andy.pook@gmail.com> wrote:
> 
> A little late to this party...
> 
> Another approach is to add a custom tokenizer. This will add an extra token
> (with a special word, like "ccc") for the same position when it hits one of
> your key words or phrases. As a result you can just search for "ccc" this
> will then return all docs that contain any of your words. You also have an
> index where you can do general searches perhaps in combination with the
> special token (such as "ccc AND ufo" to find out why ufo's cause cancer :)
> 
> At a previous gig we had a whole taxonomy of words and phrases that were
> tagged this way. Then searches could be made on concepts and abstractions
> rather than complex combinations of brackets ANDs and ORs.
> 
>> On Fri, 29 Mar 2019 at 21:39, Morgenweck <morgenweck@gmail.com> wrote:
>> 
>> Thanks to everyone-- because it is a set number of documents(about 1000)
>> and a set number of words (8000+) and time does not matter initially I'm
>> going to go with Regex initially.  I found a company that
>> https://bytescout.com/we-fight-against-cancer will donate their PDF
>> extraction software and will work with me developing the Regex.  The number
>> of hits for each word will be stored as meta data for each of the
>> articles.  Since I have total control over the words and it needs to be run
>> with each word only I can run this as a job or a nightly process and save
>> the data.  Once done it's not used until a new article appears and only for
>> that one.
>> 
>> In regards to the nuclear reactor coffee maker- I loved it-- but did you
>> ever have that feeling that you are just missing something?  And what I was
>> thinking in the back of my mind is what  Lang- said.  Index the 8000
>> words.I do plan on doing this in the next step where a new researcher will
>> come to the Cancer Center and search for words that they create nrather
>> than being limited to the 8000.  Topics that they can find were other
>> researchers that work in their same type of area. That process is where
>> Lucene.net will shine.
>> 
>> Thank you all again
>> 
>>> On Fri, Mar 29, 2019 at 10:48 AM Jörg Lang <jlang@evelix.ch> wrote:
>>> 
>>> Hi
>>> 
>>> I wouldn't go with a regex. Because it only has a hit, if the match is
>>> 100%.
>>> Using Lucene you can assign a language analyzer in indexing the
>> documents.
>>> When doing searches for your keywords you get hits for plural/singular
>> and
>>> even verb declinations are considered.
>>> 
>>> This of course at the cost, that you might get a few hits where you
>>> personally wouldn't mark it as a hit. But this is the general price of a
>>> full text search.
>>> 
>>> An idea worth exploring:
>>> - Create a document with your list of 8000 terms.
>>> - Have it indexed, with all the other documents
>>> - Do a "more like this" query giving your "terms" document as input
>>> - You get a list with documents that contain similar words like the
>> source
>>> document. The most relevant documents ranked first.
>>> 
>>> You can read about "moreLikeThis" here
>>> 
>>> 
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
>>> 
>>> 
>> https://lucene.apache.org/core/7_3_1/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html
>>> 
>>> This might give you also some input.
>>> 
>>> Joerg
>>> 
>>> 
>>> -----Ursprüngliche Nachricht-----
>>> Von: Morgenweck, William <morgenww@musc.edu>
>>> Gesendet: Freitag, 29. März 2019 02:48
>>> An: user@lucenenet.apache.org
>>> Betreff: index publication articles
>>> 
>>> I need to ask this question because I think it might be something
>>> Lucene.Net can do but I'm not sure.  I have a list of 8,000+ words that
>> are
>>> considered Cancer Terms by the NCI
>>> https://www.cancer.gov/publications/dictionaries/cancer-terms?expand=A
>>> I have the terms stored locally but I need to index articles that I have
>>> downloaded and count the number of times each word appears in the
>> article.
>>> The purpose for this is to determine if the article is Cancer Related.  I
>>> work for a NCI Designated Cancer Center and I need a way to analyze the
>> our
>>> Researchers publications that are Members of our Cancer Center.  I know
>>> that a slow way to do this is to loop each and every word and see if the
>>> indexof give a positive result or I have found a suggestion of creating a
>>> match criteria using Regex with all 8,000 words.
>>> 
>>> But I feel that if I Index the Cancer Terms using Lucene.net I should be
>>> able to do the same thing but faster????
>>> 
>>> If I'm totally off the mark just let me know.  I've been on the user
>> group
>>> for over 15 years and love the potential.
>>> 
>>> Thanks,
>>> Bill
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -------------------------------------------------------------------------
>>> This message was secured via TLS by MUSC.
>>> 
>> 

Mime
View raw message