lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Pook <andy.p...@gmail.com>
Subject Re: index publication articles
Date Mon, 13 May 2019 18:08:52 GMT
A little late to this party...

Another approach is to add a custom tokenizer. This will add an extra token
(with a special word, like "ccc") for the same position when it hits one of
your key words or phrases. As a result you can just search for "ccc" this
will then return all docs that contain any of your words. You also have an
index where you can do general searches perhaps in combination with the
special token (such as "ccc AND ufo" to find out why ufo's cause cancer :)

At a previous gig we had a whole taxonomy of words and phrases that were
tagged this way. Then searches could be made on concepts and abstractions
rather than complex combinations of brackets ANDs and ORs.

On Fri, 29 Mar 2019 at 21:39, Morgenweck <morgenweck@gmail.com> wrote:

> Thanks to everyone-- because it is a set number of documents(about 1000)
> and a set number of words (8000+) and time does not matter initially I'm
> going to go with Regex initially.  I found a company that
> https://bytescout.com/we-fight-against-cancer will donate their PDF
> extraction software and will work with me developing the Regex.  The number
> of hits for each word will be stored as meta data for each of the
> articles.  Since I have total control over the words and it needs to be run
> with each word only I can run this as a job or a nightly process and save
> the data.  Once done it's not used until a new article appears and only for
> that one.
>
> In regards to the nuclear reactor coffee maker- I loved it-- but did you
> ever have that feeling that you are just missing something?  And what I was
> thinking in the back of my mind is what  Lang- said.  Index the 8000
> words.I do plan on doing this in the next step where a new researcher will
> come to the Cancer Center and search for words that they create nrather
> than being limited to the 8000.  Topics that they can find were other
> researchers that work in their same type of area. That process is where
> Lucene.net will shine.
>
> Thank you all again
>
> On Fri, Mar 29, 2019 at 10:48 AM Jörg Lang <jlang@evelix.ch> wrote:
>
> > Hi
> >
> > I wouldn't go with a regex. Because it only has a hit, if the match is
> > 100%.
> > Using Lucene you can assign a language analyzer in indexing the
> documents.
> > When doing searches for your keywords you get hits for plural/singular
> and
> > even verb declinations are considered.
> >
> > This of course at the cost, that you might get a few hits where you
> > personally wouldn't mark it as a hit. But this is the general price of a
> > full text search.
> >
> > An idea worth exploring:
> > - Create a document with your list of 8000 terms.
> > - Have it indexed, with all the other documents
> > - Do a "more like this" query giving your "terms" document as input
> > - You get a list with documents that contain similar words like the
> source
> > document. The most relevant documents ranked first.
> >
> > You can read about "moreLikeThis" here
> >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
> >
> >
> https://lucene.apache.org/core/7_3_1/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html
> >
> > This might give you also some input.
> >
> > Joerg
> >
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Morgenweck, William <morgenww@musc.edu>
> > Gesendet: Freitag, 29. März 2019 02:48
> > An: user@lucenenet.apache.org
> > Betreff: index publication articles
> >
> > I need to ask this question because I think it might be something
> > Lucene.Net can do but I'm not sure.  I have a list of 8,000+ words that
> are
> > considered Cancer Terms by the NCI
> > https://www.cancer.gov/publications/dictionaries/cancer-terms?expand=A
> > I have the terms stored locally but I need to index articles that I have
> > downloaded and count the number of times each word appears in the
> article.
> > The purpose for this is to determine if the article is Cancer Related.  I
> > work for a NCI Designated Cancer Center and I need a way to analyze the
> our
> > Researchers publications that are Members of our Cancer Center.  I know
> > that a slow way to do this is to loop each and every word and see if the
> > indexof give a positive result or I have found a suggestion of creating a
> > match criteria using Regex with all 8,000 words.
> >
> > But I feel that if I Index the Cancer Terms using Lucene.net I should be
> > able to do the same thing but faster????
> >
> > If I'm totally off the mark just let me know.  I've been on the user
> group
> > for over 15 years and love the potential.
> >
> > Thanks,
> > Bill
> >
> >
> >
> >
> >
> > -------------------------------------------------------------------------
> > This message was secured via TLS by MUSC.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message