Evergreen DokuWiki

Search Relevancy Ranking

Summary

The base relevancy score is determined by the cover density of the searched terms. After this base score is determined, items may receive score bumps based on word order, matching on the first word, and exact matches depending on the type of search performed.

Gory Details

The core relevance is based on the average, across the containing metarecord, of the cover density of the stemmed version of the searched terms in all matching record. heh… here's an example:

Metarecord (MR) 1 contains records (REC) 5, 6 and 7, and MR 2 contains REC 8.

Keywords for:

REC 5: "foo bar baz boo"
REC 6: "fee bar fie fo baz"
REC 7: "bar baz fee fo bar"
REC 8: "harry potter and the cowboy bar rowling j k"

User searches for "bar", which matches all of the above keywords. We score the individual matches:

REC 5: 0.9
REC 6: 0.8 (keywords are longer, so cover density is lower)
REC 7: 1.4 (matched twice, CD is higher)
REC 8: 0.6

Then, we get the average per MR:

MR 1 := AVG( 0.9, 0.8, 1.4 ) == 1.03
MR 2 := AVG( 0.6 ) == 0.6

That provides a base to work with, which we then augment with bonuses in specific cases. Bonuses, or "relevance bumps", are supplied as a multiplier to the base CD relevance on a per-record basis. They cause a small effect when only one of many RECs within a MR gain that bonus, but when most RECs gain that bonus they help to push the "best" MRs to the top of the list.

For subject and author searches, because the indexed strings are so short, and word order basically irrelevant, there are not special bonuses and we use the base CD ranking for relevance ordering.

For keyword searches, word order is important, because people generally type in phrases as they would speak them and the indexed strings are long enough for order to be relevant. With that in mind, we provide a bonus multiplier of 1.2 (20%) to the rank of each matching record where all of the words (unstemmed, but stripped of diacritics and lower cased) in the user's query are in the same order as they are in the searched field.

For series title searching, word order is less important, but the first indexable, non-article word and the possibility of a complete match are important. A first-word match (the first indexable word in both the query and the searched string are the same) provides a bonus of 1.5 (50%). If the entire series title, minus leading articles, matches the entire user query then we provide a very big bonus of 200.

Title searches, which search all titles but series titles, combine the relevance bonuses from keyword (word order) and series (first word and whole string) to provide very good matching, especially on very short titles which would otherwise have little chance of coming to the top, even thought they are the most likely target of a short title query.

Table of Contents

Search Relevancy Ranking

Summary

Gory Details