Skip to content

Commit 1d6ed82

Browse files
authored
Improve similarity docs. (#29089)
This adds links to the relevant Lucene javadocs and warnings regarding similarities that might return 0 as a score. Close #29015
1 parent 08c5309 commit 1d6ed82

File tree

1 file changed

+42
-11
lines changed

1 file changed

+42
-11
lines changed

docs/reference/index-modules/similarity.asciidoc

Lines changed: 42 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -97,22 +97,38 @@ similarity has the following option:
9797
Type name: `classic`
9898

9999
[float]
100-
[[drf]]
100+
[[dfr]]
101101
==== DFR similarity
102102

103103
Similarity that implements the
104-
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
104+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
105105
from randomness] framework. This similarity has the following options:
106106

107107
[horizontal]
108108
`basic_model`::
109-
Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
109+
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`be`],
110+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelD.html[`d`],
111+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`g`],
112+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIF.html[`if`],
113+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIn.html[`in`],
114+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIne.html[`ine`] and
115+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelP.html[`p`].
116+
117+
`be`, `d` and `p` should be avoided in practice as they might return scores that
118+
are equal to 0 or infinite with terms that do not meet the expected random
119+
distribution.
110120

111121
`after_effect`::
112-
Possible values: `no`, `b` and `l`.
122+
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffect.NoAfterEffect.html[`no`],
123+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectB.html[`b`] and
124+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectL.html[`l`].
113125

114126
`normalization`::
115-
Possible values: `no`, `h1`, `h2`, `h3` and `z`.
127+
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/Normalization.NoNormalization.html[`no`],
128+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h1`],
129+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH2.html[`h2`],
130+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h3`] and
131+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationZ.html[`z`].
116132

117133
All options but the first option need a normalization value.
118134

@@ -127,23 +143,34 @@ model.
127143
This similarity has the following options:
128144

129145
[horizontal]
130-
`independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`.
146+
`independence_measure`:: Possible values
147+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceStandardized.html[`standardized`],
148+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceSaturated.html[`saturated`],
149+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceChiSquared.html[`chisquared`].
150+
151+
When using this similarity, it is highly recommended to remove stop words to get
152+
good relevance. Also beware that terms whose frequency is less than the expected
153+
frequency will get a score equal to 0.
131154

132155
Type name: `DFI`
133156

134157
[float]
135158
[[ib]]
136159
==== IB similarity.
137160

138-
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
161+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IBSimilarity.html[Information
139162
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
140163
sequence is primarily determined by the repetitive usage of its basic elements.
141164
For written texts this challenge would correspond to comparing the writing styles of different authors.
142165
This similarity has the following options:
143166

144167
[horizontal]
145-
`distribution`:: Possible values: `ll` and `spl`.
146-
`lambda`:: Possible values: `df` and `ttf`.
168+
`distribution`:: Possible values:
169+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionLL.html[`ll`] and
170+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionSPL.html[`spl`].
171+
`lambda`:: Possible values:
172+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaDF.html[`df`] and
173+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaTTF.html[`ttf`].
147174
`normalization`:: Same as in `DFR` similarity.
148175

149176
Type name: `IB`
@@ -152,19 +179,23 @@ Type name: `IB`
152179
[[lm_dirichlet]]
153180
==== LM Dirichlet similarity.
154181

155-
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
182+
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
156183
Dirichlet similarity] . This similarity has the following options:
157184

158185
[horizontal]
159186
`mu`:: Default to `2000`.
160187

188+
The scoring formula in the paper assigns negative scores to terms that have
189+
fewer occurrences than predicted by the language model, which is illegal to
190+
Lucene, so such terms get a score of 0.
191+
161192
Type name: `LMDirichlet`
162193

163194
[float]
164195
[[lm_jelinek_mercer]]
165196
==== LM Jelinek Mercer similarity.
166197

167-
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
198+
{lucene-core-javadoc}/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
168199
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
169200

170201
[horizontal]

0 commit comments

Comments
 (0)