1. Overview

Density ratio estimation is described as follows: for given two data samples $$x$$ and $$y$$ from unknown distributions $$p_{nu}(x)$$ and $$p_{de}(y)$$ respectively, estimate

$w(x) = \frac{p_{nu}(x)}{p_{de}(x)}$

where $$x$$ and $$y$$ are $$d$$-dimensional real numbers.

The estimated density ratio function $$w(x)$$ can be used in many applications such as the inlier-based outlier detection[1], covariate shift adaptation[2] and etc[3].

The package densratio provides a function densratio() that returns a result has the function to estimate density ratio compute_density_ratio().

For example,

set.seed(3)
x <- rnorm(200, mean = 1, sd = 1/8)
y <- rnorm(200, mean = 1, sd = 1/2)

library(densratio)
result <- densratio(x, y)
result
##
## Call:
## densratio(x = x, y = y, method = "uLSIF")
##
## Kernel Information:
##   Kernel type:  Gaussian RBF
##   Number of kernels:  100
##   Bandwidth(sigma):  0.1
##   Centers:  num [1:100, 1] 1.007 0.752 0.917 0.824 0.7 ...
##
## Kernel Weights(alpha):
##   num [1:100] 0.4044 0.0479 0.1736 0.125 0.0597 ...
##
## Regularization Parameter(lambda):  0.1
##
## The Function to Estimate Density Ratio:
##   compute_density_ratio()

In this case, the true density ratio $$w(x)$$ is known, so we can compare $$w(x)$$ with the estimated density ratio $$\hat{w}(x)$$.

true_density_ratio <- function(x) dnorm(x, 1, 1/8) / dnorm(x, 1, 1/2)

3.2. Methods

densratio() has method parameter that you can pass "uLSIF" or "KLIEP".

• uLSIF (unconstrained Least-Squares Importance Fitting) is the default method. This method estimates density ratio by minimizing the squared loss. You can find more information in Hido(2011)[1].

• KLIEP (Kullback-Leibler Importance Estimation Procedure) is the anothor method. This method estimates density ratio by minimizing Kullback-Leibler divergence. You can find more information in Sugiyama(2007)[2].

The both methods assume that the denity ratio is represented by linear model:

$w(x) = \alpha_1 K(x, c_1) + \alpha_2 K(x, c_2) + ... + \alpha_b K(x, c_b)$

where

$K(x, c) = \exp\left(-\frac{\|x - c\|^2}{2 \sigma ^ 2}\right)$

is the Gaussian RBF.

densratio() performs the two main jobs:

• First, deciding kernel parameter $$\sigma$$ by cross validation,
• Second, optimizing kernel weights $$\alpha$$.

As the result, you can obtain compute_density_ratio().

3.3. Result and Paremeter Settings

densratio() outputs the result like as follows:

##
## Call:
## densratio(x = x, y = y, method = "uLSIF")
##
## Kernel Information:
##   Kernel type:  Gaussian RBF
##   Number of kernels:  100
##   Bandwidth(sigma):  0.1
##   Centers:  num [1:100, 1] 1.007 0.752 0.917 0.824 0.7 ...
##
## Kernel Weights(alpha):
##   num [1:100] 0.4044 0.0479 0.1736 0.125 0.0597 ...
##
## Regularization Parameter(lambda):  0.1
##
## The Function to Estimate Density Ratio:
##   compute_density_ratio()
• Kernel type is fixed by Gaussian RBF.
• Number of kernels is the number of kernels in the linear model. You can change by setting kernel_num parameter. In default, kernel_num = 100.
• Bandwidth(sigma) is the Gaussian kernel bandwidth. In default, sigma = "auto", the algorithms automatically select the optimal value by cross validation. If you set sigma a single number, it will be used. If you set a numeric vector, the algorithms select the optimal value in them by cross validation.
• Centers are centers of Gaussian kernels in the linear model. These are selected at random from the data sample x underlying a numerator distribution p_nu(x). You can find the whole values in result$kernel_info$centers.
• Kernel weights is the alpha parameters in the linear model. It was optimaized by the algorithms. You can find the whole values in result$alpha. • The funtion to estimate density ratio is named compute_density_ratio(). 4. Multi Dimensional Data Samples In the above, the input data samples x and y were one dimensional. densratio() allows to input multidimensional data samples as matrix. For example, library(densratio) library(mvtnorm) set.seed(71) x <- rmvnorm(300, mean = c(1, 1), sigma = diag(1/8, 2)) y <- rmvnorm(300, mean = c(1, 1), sigma = diag(1/2, 2)) result <- densratio(x, y) result ## ## Call: ## densratio(x = x, y = y, method = "uLSIF") ## ## Kernel Information: ## Kernel type: Gaussian RBF ## Number of kernels: 100 ## Bandwidth(sigma): 0.316 ## Centers: num [1:100, 1:2] 1.178 0.863 1.453 0.961 0.831 ... ## ## Kernel Weights(alpha): ## num [1:100] 0.145 0.128 0.138 0.187 0.303 ... ## ## Regularization Parameter(lambda): 0.1 ## ## The Function to Estimate Density Ratio: ## compute_density_ratio() Also in this case, we can compare the true density ratio with the estimated density ratio. true_density_ratio <- function(x) { dmvnorm(x, mean = c(1, 1), sigma = diag(1/8, 2)) / dmvnorm(x, mean = c(1, 1), sigma = diag(1/2, 2)) } estimated_density_ratio <- result$compute_density_ratio

N <- 20
range <- seq(0, 2, length.out = N)
input <- expand.grid(range, range)
z_true <- matrix(true_density_ratio(input), nrow = N)
z_hat <- matrix(estimated_density_ratio(input), nrow = N)

old_par <- par(mfrow = c(1, 2))
contour(range, range, z_true, main = "True Density Ratio")
contour(range, range, z_hat, main = "Estimated Density Ratio")

par(old_par)

The dimensions of x and y must be same.

5. References

[1] Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems 2011.

[2] Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P. & Kawanabe, M. Direct importance estimation with model selection and its application to covariate shift adaptation. NIPS 2007.

[3] Sugiyama, M., Suzuki, T. & Kanamori, T. Density Ratio Estimation in Machine Learning. Cambridge University Press 2012.