Mutual Information Between Two Continuous Distributions

PLoS One. 2014; 9(2): e87357.

Mutual Information between Discrete and Continuous Data Sets

Brian C. Ross

Department of Physics, University of Washington, Seattle, Washington, United States of America

Daniele Marinazzo, Editor

Received 2013 Nov 26; Accepted 2013 Dec 19.

Supplementary Materials: Script S1: Slow (vector) MI calculator. Estimates MI between two vector or scalar data sets using the nearest-neighbor method.

(M)

GUID: A023BB14-AEDB-4609-AAB3-85587FD6F32B

Script S2: Fast (scalar) MI calculator. Estimates MI between two scalar data sets using the nearest-neighbor method.

(M)

GUID: 6F9570C9-7E93-4D3C-8A01-57F5FBF555DE

Script S3: Binning MI calculator. Estimates MI between two scalar data sets using the binning method.

(M)

GUID: F40B4602-07AB-4288-97B4-CA8B9FB8C21B

Script S4: Testing script. Compares the methods using sampled data drawn from user-defined distributions. This script was used to generate the plots in this paper.

(M)

GUID: FC80087F-3117-41F3-9B26-7AB3E11460B9

Abstract

Mutual information (MI) is a powerful method for detecting relationships between data sets. There are accurate methods for estimating MI that avoid problems with "binning" when both data sets are discrete or when both data sets are continuous. We present an accurate, non-binning MI estimator for the case of one discrete data set and one continuous data set. This case applies when measuring, for example, the relationship between base sequence and gene expression level, or the effect of a cancer drug on patient survival time. We also show how our method can be adapted to calculate the Jensen–Shannon divergence of two or more data sets.

Introduction

Mutual information (MI) [1] is in several ways a perfect statistic for measuring the degree of relatedness between data sets. First, MI will detect any sort of relationship between data sets whatsoever, whether it involves the mean values or the variances or higher moments. Second, MI has a straightforward interpretation as the amount of shared information between data sets (measured in, for example, bits); other statistics such as rank-ordering are harder to interpret. Since MI is grounded in information theory it has an established base of theoretical tools. Finally, MI is insensitive to the size of the data sets. Whereas a 'p-value' test for strict independence can be pushed arbitrarily low by taking a large data set if the variables are even slightly related, MI will simply converge with tight error bounds to a measure of their relatedness.

The MI between two data sets An external file that holds a picture, illustration, etc. Object name is pone.0087357.e001.jpg and can be estimated from the statistics of the pairs between the two data sets. (Although MI is straightforward to calculate if the underlying probability distribution is known, that is not usually the case: our knowledge of the distribution generally comes from the sampled data itself, so MI must be estimated from the statistics of our data set.) For example, if we were to compare the day of week ( An external file that holds a picture, illustration, etc. Object name is pone.0087357.e004.jpg ) with the time of breakfast () we might find that when is a weekday the corresponding is early in the morning, and when An external file that holds a picture, illustration, etc. Object name is pone.0087357.e008.jpg is Sunday or (especially) Saturday the corresponding is somewhat later. MI quantifies the strength of this effect. Importantly, the procedure for estimating MI depends on whether An external file that holds a picture, illustration, etc. Object name is pone.0087357.e010.jpg and take discrete values (e.g. a day of week, a nucleobase, a phenotypic category, etc.), or are real-valued continuous variables (a time of day, a gene expression level, a patient's survival time, etc.). If An external file that holds a picture, illustration, etc. Object name is pone.0087357.e012.jpg and are both discrete, then we can estimate the true frequencies of all combinations of pairs by counting the number of times each pair occurs in the data, and straightforwardly use these frequencies to estimate MI. Real-valued data sets are more difficult to deal with, since they are by definition sparsely sampled: most real numbers will not be found in a data set of any size. The common workaround is to lump the continuous variables into discrete 'bins' and then apply a discrete MI estimator, but good sampling requires large bins which destroys resolution. An improved continuous-continuous MI estimator described in Ref. [2] circumvents this tradeoff by using statistics of the spacings between data points and their nearest neighbors. Crucially, their method only works when both variables are real-valued, as the nearest neighbor of a discrete variable is not well-defined.

This paper describes a method for estimating the MI between a discrete data set and a continuous (scalar or vector) data set, using a similar approach to that of Ref. [2]. This is an important statistic simply because so many scientific activities involve a search for significant relationships between discrete and continuous variables. For example, one might use MI to quantify the extent to which nationality (a discrete variable) determines income (continuous); to identify DNA bases (ACGT, discrete) that affect a given gene's expression level (continuous); or to find drugs (given or not: a discrete parameter) that alter cell division rates (continuous data). In the University of Washington Nanopore Physics lab we use this estimator to determine where a given DNA base must sit within the sequencing pore in order to affect the current passing through it, and to quantify the relative influence of different base positions on the current. As we will demonstrate, our nearest-neighbors method estimates MI much more reliably than does the present alternative method of 'binning' the data.

MI between a discrete and a continuous variable is equivalent to a weighted form of the Jensen-Shannon (JS) divergence [3] which is used as a measure of the dissimilarity between two or more continuous probability distributions. We can therefore apply our method to estimate the weighted JS divergence, by storing samples from each distribution to be compared in the continuous data set An external file that holds a picture, illustration, etc. Object name is pone.0087357.e015.jpg , and using the discrete data set to identify which distribution each sample was drawn from. To use our method to estimate the unweighted JS divergence, we would either draw equal numbers of samples from each distribution, or else modify our method somewhat as explained in the Analysis section.

Results

To test our method, we chose two simple distributions An external file that holds a picture, illustration, etc. Object name is pone.0087357.e089.jpg : a square wave distribution in for each value in , and a Gaussian distribution in for each (Figure 2A). Because we knew the exact form of the distributions, we were able to calculate MI exactly using its mathematical definition:

An external file that holds a picture, illustration, etc. Object name is pone.0087357.g002.jpg

equation image

(4)

Next, from each distribution, we constructed test data sets by randomly sampling a certain number An external file that holds a picture, illustration, etc. Object name is pone.0087357.e103.jpg of data pairs. We then independently estimated MI from those data sets using our nearest-neighbor estimator and also using our binning estimator, and compared those estimates to each other and to the exact result. We also compared the MI estimate between our vector and scalar implementations of the nearest-neighbor method. Their results in all cases are in exact agreement with each other. This is a strong check that the scripts were written correctly, since the two estimators were coded quite differently.

Both the nearest-neighbor method and the binning method involve a somewhat arbitrary parameter that must be set by the user. The nearest neighbor method requires that the user specify An external file that holds a picture, illustration, etc. Object name is pone.0087357.e105.jpg (the th neighbor). should be some low integer, much less than the number of data points , so Figure 2B plots MI estimated by nearest neighbors over the range An external file that holds a picture, illustration, etc. Object name is pone.0087357.e109.jpg . Likewise, the binning method requires that the user specify the number of data points per bin. It is less obvious what the best value of should be; Figure 2C plots MI estimated by binning over all possible values An external file that holds a picture, illustration, etc. Object name is pone.0087357.e112.jpg .

Our first conclusion is that there is a much simpler prescription for setting the An external file that holds a picture, illustration, etc. Object name is pone.0087357.e113.jpg parameter of the nearest-neighbor estimator than the parameter of the binning method. The nearest-neighbor estimator consistently gives good results when An external file that holds a picture, illustration, etc. Object name is pone.0087357.e115.jpg is set to a low integer. Reference [2] suggests using , and that choice works well with our estimator too. By contrast, the binning estimator overestimates MI when An external file that holds a picture, illustration, etc. Object name is pone.0087357.e117.jpg is low and underestimates MI when is high, and although there is guaranteed to be a crossing point where the method is accurate it is hard to guess where that point might be. (In the limit An external file that holds a picture, illustration, etc. Object name is pone.0087357.e119.jpg the binning method estimates MI to be the entropy of the discrete variable. The actual MI only attains this maximum limit if the sub-distributions are all completely separated in An external file that holds a picture, illustration, etc. Object name is pone.0087357.e121.jpg . In the limit the binning method estimates MI to be zero.).

Our second conclusion is that there is no simple way to calculate the optimal binning parameter An external file that holds a picture, illustration, etc. Object name is pone.0087357.e123.jpg based on simple statistics of the data, such as the total number of data points or frequencies with which different discrete symbols occur. For example, the large Gaussian data sets and the large square-wave data sets each have 10000 data points per set, with twice as many red points as blue points on average, and five times more reds than greens. But the best value of An external file that holds a picture, illustration, etc. Object name is pone.0087357.e125.jpg is ∼100 for the square-wave data set and ∼600 for the Gaussian data sets. This is easiest to see in Figure 3A, which plots the ratio of the median binning error using given An external file that holds a picture, illustration, etc. Object name is pone.0087357.e126.jpg to the median nearest-neighbors error using . We find that there is no choice of for which binning is better than nearest-neighbors for both the square wave and Gaussian data sets. Figure 3B shows roughly the same result for the 400-point data sets, which again are statistically similar except in the shape of their distributions in An external file that holds a picture, illustration, etc. Object name is pone.0087357.e129.jpg .

An external file that holds a picture, illustration, etc. Object name is pone.0087357.g003.jpg

We conclude that MI estimation by the nearest neighbor method is far more accurate than binning-based MI estimates, barring a lucky guess of the unknowable best value of An external file that holds a picture, illustration, etc. Object name is pone.0087357.e136.jpg . Furthermore, our nearest-neighbor method is computationally cheap: both computation time and memory usage are proportional to for the scalar estimator. Therefore nearest neighbors should be the method of choice for estimating MI in the discrete-continuous case.

Analysis

Here we derive the formula for our nearest-neighbor MI estimator.

Consider a discrete variable An external file that holds a picture, illustration, etc. Object name is pone.0087357.e138.jpg and the continuous variable , drawn from probability density . Both and may be either univariate (composed of scalars) or multivariate (vectors). We will write discrete probability functions as An external file that holds a picture, illustration, etc. Object name is pone.0087357.e143.jpg and continuous densities using the symbol : therefore and . The mutual information is:

equation image

(5)

Here An external file that holds a picture, illustration, etc. Object name is pone.0087357.e148.jpg denotes an entropy, is the probability density for sampling irrespective of the value of , and An external file that holds a picture, illustration, etc. Object name is pone.0087357.e152.jpg is the probability density for sampling given a particular value of . The averages are taken over the full distribution and weighted by An external file that holds a picture, illustration, etc. Object name is pone.0087357.e155.jpg , and they would be straightforward to calculate if we knew the underlying density functions. Alternatively, each average can be taken over a representative set from pairs sampled from the distribution; using this latter interpretation we estimate the MI from the mean of An external file that holds a picture, illustration, etc. Object name is pone.0087357.e157.jpg and at each of our sampled data points. The more points we have, the greater the accuracy.

The remaining task is to estimate the logarithm of two continuous distributions evaluated at given data points. For this we use a nearest-neighbor entropy estimator originally developed by Kozachenko and Leonenko [5] whose proof we will briefly outline. Given a point An external file that holds a picture, illustration, etc. Object name is pone.0087357.e159.jpg , we define as the volume of points centered about that are closer to point than its An external file that holds a picture, illustration, etc. Object name is pone.0087357.e163.jpg th neighbor. The estimator uses Bayesian arguments to identify with ( denotes a probability density that is not to be confused with An external file that holds a picture, illustration, etc. Object name is pone.0087357.e167.jpg ). Approximating the density function as being constant throughout the neighborhood of point , we find:

equation image

(6)

where An external file that holds a picture, illustration, etc. Object name is pone.0087357.e171.jpg is the beta function [4] and is the digamma function. We can now estimate the entropy using the full data set:

equation image

(7)

where the average is taken over all sampled data points.

For each sampled data point An external file that holds a picture, illustration, etc. Object name is pone.0087357.e174.jpg we employ the Kozachenko-Leonenko (KL) entropy estimator twice: once to estimate by finding a neighbor from the full set of data points, and once to estimate An external file that holds a picture, illustration, etc. Object name is pone.0087357.e176.jpg by finding a neighbor in the subset of data points for which . Notice that we can independently choose the neighbors of the two points: we will pick the An external file that holds a picture, illustration, etc. Object name is pone.0087357.e179.jpg th neighbor in the reduced distribution and the th neighbor from the full distribution. The result is

equation image

(8)

There is a systematic averaging error that comes from the fact that the An external file that holds a picture, illustration, etc. Object name is pone.0087357.e182.jpg th-neighbor KL entropy estimator applied to point necessarily computes the average of over the volume , rather than evaluated exactly at point An external file that holds a picture, illustration, etc. Object name is pone.0087357.e186.jpg . Following Ref. [2], we attempt to minimize this error by choosing and so that both uses of the KL entropy estimator use the same neighbor An external file that holds a picture, illustration, etc. Object name is pone.0087357.e189.jpg . Therefore for each data point, and we obtain Eq. 2. The cancellation is only partial; but because the averaging error scales with the number of data pairs as An external file that holds a picture, illustration, etc. Object name is pone.0087357.e191.jpg whereas the counting error scales as , averaging error is generally insignificant except for very small data sets (as we have verified in our tests).

As mentioned before, the mutual information between discrete and continuous data is equivalent to a weighted Jensen-Shannon (JS) divergence between the conditional distributions An external file that holds a picture, illustration, etc. Object name is pone.0087357.e193.jpg , where the frequencies of the discrete symbols are the weighting factors. To compute an unweighted JS divergence we need to place all the conditional distributions on equal footing irrespective of their frequencies in the data, by weighting each term in the averages in Eq. 5 by the factor An external file that holds a picture, illustration, etc. Object name is pone.0087357.e196.jpg where is the number of distinct values that can take. The result is

equation image

(9)

Supporting Information

Script S1

Slow (vector) MI calculator. Estimates MI between two vector or scalar data sets using the nearest-neighbor method.

(M)

Script S2

Fast (scalar) MI calculator. Estimates MI between two scalar data sets using the nearest-neighbor method.

(M)

Script S3

Binning MI calculator. Estimates MI between two scalar data sets using the binning method.

(M)

Script S4

Testing script. Compares the methods using sampled data drawn from user-defined distributions. This script was used to generate the plots in this paper.

(M)

Acknowledgments

The author wishes to acknowledge Vikram Agarwal and Walter Ruzzo for helpful discussions, and Andrew Laszlo, Henry Brinkerhoff, Jens Gunlach and Jenny Mae Samson for valuable comments on this paper.

Funding Statement

Funding for this work was provided by National Institutes of Health/National Human Genome Research Institute grant R01HG005115. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Cover T, Thomas J (1991) Elements of information theory. New York: John Wiley & Sons. [Google Scholar]

2. Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Physical Review E 69: 066138. [PubMed] [Google Scholar]

3. Grosse I, Bernaola-Galván P, Carpena P, Román-Roldán R, Oliver J, et al. (2002) Analysis of symbolic sequences using the jensen-shannon divergence. Physical Review E 65: 041905. [PubMed] [Google Scholar]

4. Abramowitz M, Stegun I (1970) Handbook of mathematical functions. New York: Dover Publishing Inc. [Google Scholar]

5. Kozachenko L, Leonenko NN (1987) Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii 23: 9–16. [Google Scholar]

Articles from PLoS ONE are provided here courtesy of Public Library of Science

archeroppervis1983.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3929353/

Mutual Information Between Two Continuous Distributions

Mutual Information between Discrete and Continuous Data Sets

Brian C. Ross

Abstract

Introduction

Results

Analysis

Supporting Information

Script S1

Script S2

Script S3

Script S4

Acknowledgments

Funding Statement

References

0 Response to "Mutual Information Between Two Continuous Distributions"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel