{"title": "Second Order Approximations for Probability Models", "book": "Advances in Neural Information Processing Systems", "page_first": 238, "page_last": 244, "abstract": null, "full_text": "Second order approximations for probability \n\nmodels \n\nHilbert J. Kappen \n\nDepartment of Biophysics \n\nNijmegen University \n\nNijmegen, the Netherlands \nbert@mbfys.kun.nl \n\nWim Wiegerinck \n\nDepartment of Biophysics \n\nNijmegen University \n\nNijmegen, the Netherlands \nwimw@mbfys.kun.nl \n\nAbstract \n\nIn this paper, we derive a second order mean field theory for directed \ngraphical probability models. By using an information theoretic argu(cid:173)\nment it is shown how this can be done in the absense of a partition \nfunction. This method is a direct generalisation of the well-known TAP \napproximation for Boltzmann Machines. In a numerical example, it is \nshown that the method greatly improves the first order mean field ap(cid:173)\nproximation. For a restricted class of graphical models, so-called single \noverlap graphs, the second order method has comparable complexity to \nthe first order method. For sigmoid belief networks, the method is shown \nto be particularly fast and effective. \n\n1 Introduction \n\nRecently, a number of authors have proposed deterministic methods for approximate infer(cid:173)\nence in large graphical models. The simplest approach gives a lower bound on the prob(cid:173)\nability of a subset of variables using Jenssen's inequality (Saul et aI., 1996). The method \ninvolves the minimization of the KL divergence between the target probability distribution \np and some 'simple' variational distribution q. The method can be applied to a large class \nof probability models, such as sigmoid belief networks, DAGs and Boltzmann Machines \n(BM). \n\nFor Boltzmann-Gibbs distributions, it is possible to derive the lower bound as the first \nterm in a Taylor series expansion of the free energy around a factorized model. The free \nenergy is given by -log Z, where Z is the normalization constant of the Boltzmann-Gibbs \ndistribution: p(x) = exp(~E(X)). This Taylor series can be continued and the second order \nterm is known as the TAP correction (Plefka, 1982; Kappen and Rodriguez, 1998). The \nsecond order term significantly improves the quality of the approximation, but is no longer \na bound. \n\nFor probability distributions that are not Boltzmann-Gibbs distributions, it is not obvious \nhow to obtain the second order approximation. However, there is an alternative way to \ncompute the higher order corrections, based on an information theoretic argument. Re(cid:173)\ncently, this argument was applied to stochastic neural networks with asymmetric connec(cid:173)\ntivity (Kappen and Spanjers, 1999). Here, we apply this idea to directed graphical models. \n\n\f2 The method \nLet x = (Xl\"\", Xn) be an n-dimensional vector, with Xi taking on discrete values. Let \np(x) be a directed graphical model on x. We will assume that p(x) can be written as a \nproduct of potentials in the following way: \n\nn \n\nk=l \n\nn \n\nk=l \n\n(I) \n\nHere, Pk(Xkl'lrk) denotes the conditional probability table of variable Xk given the values \nof its parents 'Irk. xk = (X k, 'Irk) denotes the subset of variables that appear in potential k \nand \u00a2k (xk) = logpk(xk l'lrk). Potentials can be overlapping, xk n xl =1= 0, and X = Ukxk. \nWe wish to compute the marginal probability that Xi has some specific value Si in the \npresence of some evidence. We therefore denote X = (e, s) where e denote the subset of \nvariables that constitute the evidence, and S denotes the remainder of the variables. The \nmarginal is given as \n\n( \n\nP s. e -\n\n) - p(si,e) \n. \n\np(e) \n\n_I \n\n(2) \n\nBoth numerator and denominator contain sums over hidden states. These sums scale ex(cid:173)\nponentially with the size of the problem, and therefore the computation of marginals is \nintractable. \n\nWe propose to approximate this problem by using a mean field approach. Consider a fac(cid:173)\ntorized distribution on the hidden variables h: \n\n(3) \n\nWe wish to find the factorized distribution q that best approximates p(sle). Consider as a \ndistance measure \n\nKL = LP(sle) log (p~~!;)) . \n\n\u2022 \n\nIt is easy to see that the q that minimizes KL satisfies: \n\n(4) \n\n(5) \n\nWe now think of the manifold of all probability distributions of the form Eq. 1, spanned \nby the coordinates \u00a2k (xk), k = 1, ... , m. For each k, \u00a2k (xk) is a table of numbers, \nindexed by xk. This manifold contains a submanifold of factorized probability distri(cid:173)\nbutions in which the potentials factorize: \u00a2k(Xk) = Li,iEk \u00a2ki(Xi). When in addition, \nLk,iEk \u00a2ki(Xi) = logqi(xi), i E h, p(sle) reduces to q(s). \nAssume now that p(sle) is somehow close to the factorized submanifold. The difference \n~P(Si Ie) = P(Si Ie) - qi(Si) is then small, and we can expand this small difference in terms \nof changes in the parameters ~\u00a2k(xk) = \u00a2k(Xk) -logq(xk), k = 1, ... , m: \n\nt L (810gp(s~le)) ~\u00a2k(xk) \n\nk=l a;k \n\n8\u00a2k(x) \n( 8 2 10gp(sile) ) \n1 \" \n\"2 L..J L..J 8\u00a2 (xk)8\u00a2 (I) \n\n\" \n\nq \n\n1 Y \n\n+ \n\nkl a;k ,ii' \n\nk \n+ higher order terms \n\n-k \n\n-I \n~\u00a2k(X )~\u00a2I(Y) \n\nq \n\n(6) \n\n\fThe differentials are evaluated in the factorized distribution q. The left-hand size of Eq. 6 is \nzero because of Eq. 5 and we solve for q(Si)' This factorized distribution gives the desired \nmarginals up to the order of the expansion of ~ logp(sile). \nIt is straightforward to compute the derivatives: \n\n8l0gp(sile) \n8k!l<1>t) Si = ((!lk) knl !ll) 8i \n((.) knl means expectation wrt q conditioned on the variables in k n l) we can precompute \n(!lk)knl for all pairs of overlapping cliques k, l, for all states in knl. Therefore, the worse \ncase complexity of the second order term is less than ninoverlap exp(c). Thus, we see that \nthe second order method has the same exponential complexity as the first order method, but \nwith a different polynomial prefactor. Therefore, the first or second order method can be \napplied to directed graphical models as long as the number of parents is reasonably small. \n\nThe fact that the second order term has a worse complexity than the first order term is in \ncontrast to Boltzmann machines, in which the TAP approximation has the same complexity \nas the standard mean field approximation. This phenomenon also occurs for a special class \nof DAGs, which we call single-overlap graphs. These are graphs in which the potentials k \nshare at most one node. Figure 1 shows an example of a single-overlap graph. \n\nFor single overlap graphs, we can use the first order result Eq. 9 to simplify the second \norder correction. The derivation rather tedious and we just present the result \n\n\u2022 \n\n~. exp ((IOgP(X))8i + ~ L ((!lI?)8i - (!ll);.) \n,~, ~ \u00ab( (8M )., 8\u00a2'),,) \n\nl,iEI \n\n(12) \n\nwhich has a complexity that is of order ni(c -1) exp(c). For probability distributions with \nmany small potentials that share nodes with many other potentials, Eq. 12 is more efficient \nthan Eq. 11. For instance, for Boltzmann Machines ni = noverlap = n - 1 and c = 2. In \nthis case, Eq. 12 is identical to the TAP equations (Thouless et al., 1977). \n\n4 Sigmoid belief networks \n\nIn this section, we consider sigmoid belief networks as an interesting class of directed \ngraphical models. The reason is, that one can expand in terms of the couplings instead of \nthe potentials which is more efficient. The sigmoid belief network is defined as \n\n(13) \n\n\f(a) \n\n(b) \n\n(c) \n\n(d) \n\nFigure 2: Interpretation of different interaction terms appearing in Eq. 16. The open and \nshaded nodes are hidden and evidence nodes, respectively (except in (a), where k can be \nany node). Solid arrows indicate the graphical structure in the network. Dashed arrows \nindicate interaction terms that appear in Eq. 16. \n\n(1 + exp(-2x))-1, Xi = \u00b11 and hi is the local field: hi(x) = \n\nwhere O\"(x) = \n2:;=1 WijXj + 9i. \nWe separate the variables in evidence variables e and hidden variables s: X = (s, e). When \ncouplings from hidden nodes to either hidden or evidence nodes are zero, Wij = 0, i E e, s \nand j E s, the probability distributions p(sle) and p(e) reduce to \n\np(sle) \n\n-+ q(s) = II 0\" (Si9n \n\niEs \n\n(14) \n\n(15) \n\n(16) \n\nwhere 9,/ = 2:jEe Wijej + 9i depends on the evidence. \nWe expand to second order around this tractable distribution and obtain \n\niEe \n\nmi = tanh (L mjWik + 9i + 2 L r( -ek)ekwki - mi L(1 - m%}w;k \n\nkEs,e \n\nkEe \n\nkEs \n\n+ 4mi L r(ek)r( -ek)w~i - 4 L r(ek)r( -ek)mlwklwki \n\nkEe \n\nkEe,IEs \n\n+ 2 L (1 - m%}r( -el)eIWlkWki) \n\nkEs ,IEe \n\nwith mi = (Si}q ~ (Si}p and r is given by Eq. 15. \n\nThe different terms that appear in this equation can be easily interpreted. The first term \ndescribes the lowest order forward influence on node i from its parents. Parents can be \neither evidence or hidden nodes (fig. 2a). The second term is the bias 9i . The third term \ndescribes to lowest order the effect of Bayes' rule: it affects mi such that the observed \nevidence on its children becomes most probable (fig. 2b). Note, that this term is absent \nwhen the evidence is explained by the evidence nodes themselves: r(ek) = 1. The fourth \nand fifth terms are the quadratic contributions to the first and third terms, respectively. The \nsixth term describes 'explaining away'. It describes the effect of hidden node I on node i, \nwhen both have a common observed child k (fig. 2c). The last term describes the effect on \nnode i when its grandchild is observed (fig. 2d). \n\nNote, that these equations are different from Eq. 10. When one applies Eq. 10 to sigmoid \nbelief networks, one requires additional approximations to compute (logO\"(xihi)} (Saul \net aI., 1996). \n\n\f0.07 \n\n0.06 \n\n'U 0.05 \n~ \n!l..O.O4 \n~ \nE \n; 0.03 \nQ. \nu 0.02 \n\n0.01 \na \na \n\n0.7 \n\n0.6 \n\n0.5 \n\n(/) 0.4 \n::2 \na: 0.3 \n\n0.2 \n\n0.1 \na \na \n\n10 \n\nn, \n\n20 \n\n30 \n\n... \"\"' -I 1 1 \n\n0 .5 \nJ \n\nFigure 3: Second order approximation for fully connected sigmoid belief network of n \nnodes. a) nodes 1, ... , nl are hidden (white) and nodes nl + 1, ... , n are clamped (grey), \nnl = n/2; b) CPU time for exact inference (dashed) and second order approximation \n(solid) versus nl (J = 0.5); c) RMS of hidden node exact marginals (solid) and RMS error \nof second order approximation (dashed) versus coupling strength J, (nl = 10). \n\nSince only feed-forward connections are present, one can order the nodes such that Wij = 0 \nfor i < j. Then the first order mean field equations can be solved in one single sweep \nstarting with node l. The full second order equations can be solved by iteration, starting \nwith the first order solution. \n\n5 Numerical results \n\nWe illustrate the theory with two toy problems. The first one is inference in Lauritzen's \nchest clinic model (ASIA), defined on 8 binary variables x = {A, T, S, L, B, E, X, D} \n(see figure 1, and (Lauritzen and SpiegeJhaJter, 1988) for more details about the model). We \ncomputed exact marginals and approximate marginals using the approximating methods up \nto first (Eq. 10) and second order (Eq. 11), respectively. The approximate marginals are \ndetermined by sequential iteration of (10) and (11), starting at q(Xi) = 0.5 for all variables \ni. The maximal error in the marginals using the first and second order method is 0.213 \nand 0.061, respectively. We verified that the single-overlap expression Eq. 12 gave similar \nresults. \n\nIn fig. 3, we assess the accuracy and CPU time of the second order approximation Eq. 16 \nfor sigmoid belief networks. We generate random fully connected sigmoid belief networks \nwith Wij from a normal distribution with mean zero and variance J2 In and (}i = O. We \nobserve in fig. 3b that the computation time is very fast: For nl = 500, we have obtained \nconvergence in 37 second on a Pentium 300 Mhz processor. The accuracy of the method \ndepends on the size of the weights and is computed for a network of nl = 10 (fig. 3c). In \n(Kappen and Wiegerinck, 2001), we compare this approach to Saul's variational approach \n(Saul et aI., 1996) and show that our approach is much faster and slightly more accurate. \n\n\f6 Discussion \n\nIn this paper, we computed a second order mean field approximation for directed graphical \nmodels. We show that the second order approximation gives a significant improvement \nover the first order result. The method does not use explicitly that the graph is directed. \nTherefore, the result is equally valid for Markov graphs. \nThe complexity of the first and second order approximation is of a (ni exp ( c)) and \nO(ninoverlap exp(c)), respectively, with c the number of variables in the largest poten(cid:173)\ntial. For single-overlap graphs, one can rewrite the second order equation such that the \ncomputational complexity reduces to O(ni(c - 1) exp(c)). Boltzmann machines and the \nAsia network are examples of single-overlap graphs. \n\nFor large c, additional approximations are required, as was proposed by (Saul et al., 1996) \nfor the first order mean field equations. It is evident, that such additional approximations \nare then also required for the second order mean field equations. \n\nIt has been reported (Barber and Wiegerinck, 1999; Wiegerinck and Kappen, 1999) that \nsimilar numerical improvements can be obtained by using a very different approach, which \nis to use an approximating distribution q that is not factorized, but still tractable. A promis(cid:173)\ning way to proceed is therefore to combine both approaches and to do a second order \nexpansion aroud a manifold of distributions with non-factorized yet tractable distributions. \nIn this approach the sufficient statistics of the tractable structure is expanded, rather than \nthe marginal probabilities. \n\nAcknowledgments \n\nThis research was supported in part by the Dutch Technology Foundation (STW). \n\nReferences \n\nBarber, D. and Wiegerinck, W. (1999). Tractable variational structures for approximating graphi(cid:173)\n\ncal models. In Kearns, M., Solla, S ., and Cohn, D., editors, Advances in Neural Information \nProcessing Systems, volume 11 of Advances in Neural Information Processing Systems, pages \n183- 189. MIT Press. \n\nKappen, H. and Rodriguez, F. (1998). Efficient learning in Boltzmann Machines using linear response \n\ntheory. Neural Computation, 10:1137-1156. \n\nKappen, H. and Spanjers, J. (1999). Mean field theory for asymmetric neural networks. Physical \n\nReview E, 61 :5658-5663. \n\nKappen, H. and Wiegerinck, W. (2001). Mean field theory for graphical models. In Saad, D. and \n\nOpper, M., editors, Advanced mean field theory. MIT Press. \n\nLauritzen, S. and Spiegelhalter, D. (1988). Local computations with probabilties on graphical struc(cid:173)\n\ntures and their application to expert systems. J. Royal Statistical society B, 50: 154-227. \n\nPlefka, T. (1982). Convergence condition of the TAP equation for the infinite-range Ising spin glass \n\nmodel. Journal of Physics A, 15:1971- 1978. \n\nSaul, L., Jaakkola, T., and Jordan, M. (1996) . Mean field theory for sigmoid belief networks. Journal \n\nof anificial intelligence research, 4:61-76. \n\nThouless, D., Anderson, P., and Palmer, R. (1977). Solution of 'Solvable Model of a Spin Glass'. \n\nPhil. Mag., 35:593- 601. \n\nWiegerinck, W. and Kappen, H. (1999) . Approximations of bayesian networks through kl minimisa(cid:173)\n\ntion. New Generation Computing, 18:167- 175. \n\n\f", "award": [], "sourceid": 1890, "authors": [{"given_name": "Hilbert", "family_name": "Kappen", "institution": null}, {"given_name": "Wim", "family_name": "Wiegerinck", "institution": null}]}