**multivariate normal sampling**using ND4j.

The multi-variate normal sampling function is used in various machine learning techniques such as Markov chains Monte Carlo (MCMC) or contextual multi-armed bandit.

The implementation of the multivariate normal relies on data vectorization, technique well-known to Python developers and data scientists alike.

__Note__: The post requires some knowledge of data vectorization (numpy, datavec, ND4j..) as well as Scala programming language.

Vectorization

**Python Numpy**is a well-known and reliable vectorized linear algebra library which is a foundation of scientific (

*SciPy*) and machine learning (

*Sciktlearn*) libraries. No serious data scientific projects can reasonably succeeds with the power and efficiency of

**numpy.**

The

**vectorization**of datasets is the main reason behind the performance of machine learning models (training and prediction) build in Python.

Is there a similar linear algebra library, supporting vectorization, available to Scala and Spark developers? Yes,

**ND4j**

ND4j, BLAS and LAPACK

**ND4j**library replicates the functionality of numpy for Java developers. ND4j can be downloaded as an independent library or as component of the deep learning library,

**deeplearning4j.**It leverages the BLAS and LAPACK libraries.

From a Java developer perspective, the data represented by an NDArray is stored outside of the Java Virtual Machine. This design has the following benefits:

- Avoid taxing the Garbage Collector
- Interoperability with high-performance BLAS libraries such as OpenBlas
- Allow number of array elements exceeds Int.MaxValue

The

**BLAS**(Basic Linear Algebra Subprograms) are functions performing basic vector and matrix operations. The library is divided in 3 levels:- Level 1 BLAS perform scalar, vector and vector-vector operations,
- Level 2 BLAS perform matrix-vector operations
- Level 3 BLAS perform matrix-matrix operations.

**OpenBLAS**is an optimized Basic Linear Algebra Subprograms (BLAS) library based on GotoBLAS2, a C-library of linear algebra supporting a large variety of micro-processor. Its usage is governed by the BSD license.

**LAPACK**are Fortran routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems and matrix factorizations.

Implicit ND4j array conversion

The first step is to create a implicit conversation between ND4j and Scala data types.

The following code convert an array of double into a

The first step is to create a implicit conversation between ND4j and Scala data types.

The following code convert an array of double into a

**INDArray**using*org.nd4j.linalg.factory***.Nd4j****java class and its constructor***create(double[] values, int[] shape)**In case of a vector, the shape is defined in scala as Array[Int](size_vector)**In case of a matrix, the shape is defined as Array[Int](numRows, numCols).*

1 2 3 4 5 6 | @throws(clazz = classOf[IllegalArgumentException]) implicit def double2NDArray(values: Array[Double]): INDArray = { require(values.nonEmpty, "ERROR: ND4, conversion ...") Nd4j.create(values, Array[Int](1, values.length)) } |

Multivariate Normal distribution implementation

The sampling of a multivariate normal distribution is defined by the formula

\[\mathbb{N}(\mu, \Sigma )=\frac{1}{\sqrt{(2\pi)^n \left | \Sigma \right |}} e^{\frac{1}{2}(x-\mu)^T {\Sigma '}^{-1}(x-\mu)}\] A generalized definition adds a very small random perturbation factor r |r| <= 1 on the variance value (diagonal of the covariance matrix) \[\Sigma ' = \Sigma + r.I\] Sigma is the covariance matrix and the mu is the mean value.

The computation of the multivariate normal sampling can be approximated by the

**Cholesky decomposition.**In a nutshell, the Cholesky algorithm decompose a positive-definite matrix into a product of two matrix

- lower triangle matrix
- transposition of its conjugate

The following implementation relies on the direct invocation of LAPACK library function

*potrf*. The LAPACK functionality are accessed through the BLAS wrapper

*getBlasWrapper*.

1 2 3 4 5 | final def choleskyDecomposition(matrix: INDArray): INDArray = { val matrixDup = matrix.dup Nd4j.getBlasWrapper.lapack.potrf(matrixDup, false) matrixDup } |

Note that the matrix is duplicated prior to the

*LAPACK*function call as we do not want to alter the input matrix.

Let's implement the multivariate Normal sampler with

*perturbation*using the Cholesky decomposition.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | @throws(clazz = classOf[IllegalArgumentException]) @throws(clazz = classOf[IllegalStateException]) final def multiVariateNormalSampling( mean: INDArray, cov: INDArray, perturbation: Double = 0.0): INDArray = { import scala.util.Random require(cov.size(0) == mean.length, s"Sigma shape ${cov.size(0)} should be mean size ${mean.length}") require(cov.size(1) == mean.length, s"Sigma shape ${cov.size(1)} should be ${mean.length}") try { // Add a perturbation epsilon value to the covariance matrix // if it is defined val perturbMatrix = if(perturbation > 0.0) cov.add( squareIdentityMatrix(cov.size(0), perturbation*(Random.nextDouble-0.5))) else cov // Apply the Cholesky decomposition val perturbed: INDArray = choleskyDecomposition(perturbMatrix) // Instantiate a normal distribution val normalPdf = new NormalDistribution( new DefaultRandom, 0.0, 1.0) val normalDensity = Array.fill(mean.size(0))(normalPdf.sample) val normalNDArray: INDArray = normalDensity // This is an implementation of the Dot product val normalCov = perturbed.mmul(normalNDArray.transpose) // Add the normalized perturbed covariance to the mean value mean.add(normalCov) } catch { case e: org.nd4j.linalg.api.blas.BlasException => throw new IllegalStateException(s"multiVariateNormalSampling: ${e.toString}") case e: org.apache.commons.math3.exception.NotStrictlyPositiveException => throw new IllegalStateException(s"multiVariateNormalSampling: ${e.toString}") case e: Throwable => throw new IllegalStateException(s"multiVariateNormalSampling: ${e.toString}") } } |

Let's look at the full implementation of the multi-variate normal sampling.

The first step validates the shape of the mean and covariance input matrices [line 8, 9]. As mentioned earlier, the generalized normal distribution introduces an optional random perturbation of small magnitude (~1e-4) [line 14-17] that is useful for application that requires some stochastic

The '

*perturbed*' covariance matrix is factorized using the

*Cholesky*decomposition [line 22]. The normal probability density function (mean 0.0 and standard deviation 1.0) is used to generate random values [line 24-30] which is then applied to the covariance matrix [line 33].

The normal randomized variance is added to the vector of mean values [line 35].

For the sake of convenience, the multivariate normal density function uses the

*Apache Math common 3.5 API*[line 24].

Environment

Scala 2.13.2

JDK 1.8

ND4j 1.0.0-beta3

References

Scala 2.13

ND4j

Deep learning 4j

Cholesky decomposition

Approximation multivariate Normal sampling

Apache common math