Target audience: Advanced
Estimated reading time: 10'
This post illustrates the Normalized Discounted Cumulative Gain (NDCG) and its implementation in Scala.
Numerous real-life applications of machine learning require the prediction the most relevant ranking of items to optimize an outcome. For instance
- Evaluate and prioritize counter-measures to cyber-attach
- Ranks symptoms in a clinical trial
- Extract documents relevant to a given topic from a corpus
This post uses Scala 2.11.8
Discounted Cumulative Gain
Let's dive into the mathematical formalism for the Discounted Cumulative Gain.For a indexed target values tj as illustrated in the diagram above, the discounted cumulative gain is computed as
\[DCG=\sum_{j=1}^{n}\frac{2^{t_{j}}-1}{log_{2}(j+1)}\] The objective is to compare any given list of ranked/sorted item with a benchmark which represent the optimum ranking (ranking with the highest DCG value).
\[p(ranking|IDCG)=\frac{log(2)}{IDCG}\sum_{j=0}^{n}\frac{2^{t_{j}}-1}{log(j+1)}\]
Scala implementation
The implementation of the computation of NDCG in Scala is quite simple, indeed. Given a ranked list of items. The three steps are
- Compute the IDCG (or normalization factor)from the list
- Compute the DCG for the list
- Compute the NDCG = DCG/IDCF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | class NDCG[T]( firstSample: Seq[T]) (implicit ranking: T => Int) { import NDCG._ val iDCG: Double = normalize def score: Double = score(initialSample) private def dcg(samples: Seq[T]): Double = samples.zipWithIndex.aggregate(0.0)( (s, samplej) => s + compute(samplej._2 + 1, ranking(samplej._1)) , _ + _) private def normalize: Double = { val sorted = initialSample.zipWithIndex.sortBy{ case (sample, n) => -ranking(sample) }.map( _._1) dcg(sorted) } } |
The Ideal Discounted Cumulative Gain, iDCG is computed through the normalize method (line 6). iDCG (normalization factor) is computed by first sorting the items of type T by their value in decreasing order (line 16), then scoring this re-sorted list using the dcg method (line 17).
The computation of the Discounted Cumulative Gain by the method dcg (line 10) is a direct application of the formula described in the previous chapter.
Note: The logarithm function uses a base 2. It is computed as natural log(x)/natural log (2)
Let's now consider a list of items of type Item defined as follows:
case class Item(id: String, x: Double, rank: Int)
The list of items, itemsList is implicitly ranked through the last attribute, rank.
val itemsList = Seq[Item]( Item("1", 3.4, 4), Item("2", 1.4, 3), Item("3", -0.7, 5), Item("4", 5.2, 2), Item("5", 1.4, 1)) implicit val ranking = (item: Item) => item.rank
It is time to compute the NDCG coefficient for the list of items, by invoking the score method.
val nDCG = new NDCG[Item](itemsList) println(s"IDCG = ${nDCG.iDCG}") //45.64 println(s"Score = ${nDCG.score}") // 0.801
The ideal discounted cumulative gain, iDCG is 45.6: It is the optimum ranking for this list of time. The first sample score a probability of 0.8
Note The DCG of subsequent samples can be computed using the same iDCG value from the same instance of NDCG.
def score(samples: Seq[T]): Double = if( samples.size != initialSample.size) 0.0 else dcg(samples)/iDCG