{"title": "Efficient Approaches to Gaussian Process Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 251, "page_last": 257, "abstract": null, "full_text": "Efficient  Approaches to  Gaussian Process \n\nClassification \n\nLehel Csato,  Ernest  Fokoue,  Manfred  Opper,  Bernhard  Schottky \n\nNeural Computing Research Group \n\nSchool of Engineering and Applied Sciences \nAston  University Birmingham B4  7ET, UK. \n\n{opperm,csatol}~aston.ac.uk \n\nTheoretical Physics II,  Lund University, Solvegatan 14  A, \n\nOle Winther \n\nS-223  62  Lund,  Sweden \nwinther~thep.lu.se \n\nAbstract \n\nWe  present  three  simple  approximations  for  the  calculation  of \nthe  posterior  mean  in  Gaussian  Process  classification.  The  first \ntwo  methods  are  related  to mean field  ideas  known  in  Statistical \nPhysics.  The third approach is based on Bayesian online approach \nwhich was  motivated by recent results in the Statistical Mechanics \nof Neural Networks.  We present simulation results showing:  1. that \nthe mean field  Bayesian evidence may be used for  hyperparameter \ntuning and 2.  that the online approach may achieve a low  training \nerror fast. \n\n1 \n\nIntroduction \n\nGaussian  processes  provide  promising  non-parametric  Bayesian  approaches  to  re(cid:173)\ngression and classification [2,  1].  In these statistical models, it is  assumed that the \nlikelihood of an output or target variable y for  a given input x  E RN can be written \nas  P(Yla(x))  where  a  :  RN  --+  R  are functions  which  have  a  Gaussian prior distri(cid:173)\nbution, i.e.  a  is  (a priori) assumed to be a  Gaussian random field.  This means that \nany  finite  set  of field  variables  a(xi),  i  =  1, ... ,l are jointly Gaussian  distributed \nwith a given covariance E[a(xi)a(xj)] =  K(Xi' Xj)  (we will also assume a zero mean \nthroughout the paper). \nPredictions on a(x) for  novel inputs x, when a set D of m  training examples (Xi, Yi) \ni  =  1, . . . , m  , is given, can be computed from the posterior distribution of the m + 1 \nvariables  a(x)  and  a(xd, ... ,a(xm).  A  major  technical  problem of the  Gaussian \nprocess models is the difficulty of computing posterior averages as high dimensional \nintegrals, when the likelihood is not Gaussian.  This happens for  example in classifi(cid:173)\ncation problems.  So far,  a variety of approximation techniques have been discussed: \nMonte Carlo sampling [2],  the MAP approach  [4],  bounds on the likelihood [3]  and \na TAP mean field  approach [5].  In this paper, we will introduce three different novel \nmethods for  approximating the posterior mean of the random field  a(x),  which  we \nthink are simple enough to be used in practical applications.  Two of the techniques \n\n\f252 \n\nL.  Csato, E.  Fokoue,  M  Opper,  B.  Schottky and 0.  Winther \n\nare based on  mean field  ideas from  Statistical Mechanics,  which  in contrast to the \npreviously developed TAP approach are easier to implement.  They also yield simple \napproximations to the total likelihood of the data (the evidence)  which can be used \nto tune the hyperparameters in the covariance kernel K  (The Bayesian evidence (or \nMLII)  framework  aims at maximizing the likelihood of the data). \n\nWe  specialize  to  the  case  of a  binary  classification problem,  where  for  simplicity, \nthe class label Y = \u00b1 1 is  assumed to be noise free  and the likelihood is  chosen as \n(1) \nwhere  8(x)  is  the unit  step  function,  which  equals  1 for  x  >  0  and  zero else.  We \nare interested in  computing efficient  approximations to the posterior mean  (a(x)), \nwhich  we  will  use  for  a  prediction  of  the  labels  via  Y  =  sign(a(x)),  where  ( .. . ) \ndenotes the posterior expectation.  If the posterior distribution of a(x)  is symmetric \naround its mean,  this will give the Bayes optimal prediction. \n\nP(Yla)  =  8(ya)  , \n\nBefore  starting,  let  us  add  two  comments  on  the  likelihood  (1).  First,  the  MAP \napproach  (i.e.  predicting with the fields  a  that maximize the posterior)  would  not \nbe  applicable,  because  it  gives  the  trivial  result  a(x)  =  O.  Second,  noise  can  be \neasily introduced within a  probit model [2],  all subsequent calculations will only be \nslightly altered.  Moreover,  the  Gaussian  average  involved in  the  definition  of the \nprobit likelihood can always be shifted from the likelihood into the Gaussian process \nprior, by a redefinition of the fields a  (which does not change the prediction), leaving \nus  with the simple likelihood  (1)  and a  modified process covariance [5] . \n\n2  Exact  Results \n\nAt first  glance,  it may seem that in order to calculate (a(x))  we  have to deal with \nthe joint  posterior of the  fields  ai  = a(xi) '  i  = 1, ... , m  together with the field  at \nthe test point a(x) .  This would imply that for  any test point, a different new m + 1 \ndimensional average has to be performed.  Actually, we will show that this is not the \ncase.  As  above let E  denote the expectation over the Gaussian prior.  The posterior \nexpectation at any point, say x \n\nE [a (x) TI7=l  P(Yj laj)] \n(a(x))  = --=-~---~--\"-\n\nE [TI7=l P(Yj laj)] \n\ncan by  integration by  parts-for any likelihood-be written as \n\n))  ~  ) \n\n(a(x  = L...J K(x, Xj  ajYj \n\nj \n\nand  \",.=y . (CHnp(Yj1a j )) \n\n.... J \n\nJ \n\naa . \nJ \n\n(2) \n\n(3) \n\nshowing that aj  is  not dependent on the test point x.  It is  therefore not  necessary \nto compute a  m + 1 dimensional average for  every prediction. \nWe  have  chosen the  specific  definition  (3)  in  order  to stress  the similarity to  pre(cid:173)\ndictions  with  Support  Vector  Machines  (for  the  likelihood  (1),  the  aj  will  come \nout  nonnegative) .  In  the  next  sections  we  will  develop  three  approaches  for  an \napproximate computation of the aj. \n\n3  Mean Field Method I:  Ensemble  Learning \n\nOur first  goal is  to approximate the true posterior distribution \na  II P(Y \u00b7la\u00b7) \n\np(alD  ) -\n\n1 \n-\n\n1  T \n\ne-\"2 a  K \n\n1 \n\n-1 \n\nm \n\nm  - Z  J(27r)m detK \n\nj=l \n\nJ \n\nJ \n\n(4) \n\n\fEfficient Approaches to Gaussian Process Classification \n\n253 \n\nof a  ==  (al,\"\"  am)  by  a  simpler,  tractable  distribution  q.  Here,  K  denotes  the \ncovariance  matrix  with  elements  Kij  =  K (Xi, Xj).  In  the  variational  mean  field \napproach-known  as  ensemble  learning  in  the  Neural  Computation  Community,-\nthe relative entropy distance K L(q,p) =  J da q(a) In :~:~  is minimized in the family \nof  product  distributions  q(a)  =  TI,7=l qj(aj).  This  is  in  contrast  to  [3],  where  a \nvariational bound on the likelihood is  computed.  We  get \n\nKL(q,p) = \n\nqi(ai) \n\n! \n~ daiqi(ai) In P(Yilai)  + \n, \n~  L  [K-1Lj  (ai)O(aj)o + ~ L [K-1Li (a~)o \ni,j,i#j \n\ni \n\nwhere  ( .. ')0  denotes  expectation  w.r.t.  q.  By  setting  the  functional  derivative \nof  KL(q,p)  with  respect  to  qi(a)  equal  to  zero,  we  find  that  the  best  product \ndistribution is a  Gaussian prior times the original Likelihood: \n\nqi(a)  ex:  P(Yila) ~e- hi \n\n1 \n\n(o-\"'i)2 \n\n(5) \n\nwhere  mi  =  -Ai :Ej,#i(K-l)ij(aj)o and  Ai  =  [K-l]:l.  Using this  specific  form \nfor the approximated posterior q(a), replacing the average over the true posterior in \n(3)  by the approximation (5),  we  get (using the likelihood (I)) a set of m  nonlinear \nequations in the unknowns aj: \n\nwhere  D(z)  =  e- z2 / 2/..f2i  and  <I>(z)  =  J~oo dt  D(t).  As  a  useful  byproduct  of \nthe variational approximation, an upper  bound on the Bayesian evidence  P(D)  = \nJ da  7r(a)P{Dla)  can  be  derived. \n(71'  denotes  the  Gaussian  process  prior  and \nP{Dla)  =  TI,7=l  P(Yjlaj)).  The  bound  can  be  written  in  terms  of the mean  field \n'free energy' as \n\n-lnP(D)  <  Eqlnq(a) -Eqln[7r(a)P(Dla)] \n\n- '''In<l> (Y' m j\n~  , JXi \n, \n\n)  + ~ \"'y \u00b7a \u00b7(K\u00b7 \u00b7 -8\" A\u00b7)y \u00b7a\u00b7 \n'J'  3  3 \n\n2 ~ \" \n\n'3 \n\n~ \n\n(7) \n\nwhich can be used  as  a  yardstick for  selecting appropriate hyperparameters in the \ncovariance kernel. \n\nThe ensemble learning approach has the little drawback,  that it requires inversion \nof the covariance matrix K  and, for the free  energy (7)  one must compute a  deter(cid:173)\nminant.  A second,  simpler approximation avoids these computations. \n\n4  Mean Field Theory II:  A  'Naive'  Approach \n\nThe second mean field  theory aims at working directly with the variables aj.  As  a \nstarting point,  we  consider the partition function  (evidence), \n\nZ  =  P(D) = ! dze-tzTKz  IT P(Yjlzj)  , \n\n;=1 \n\n(8) \n\n\f254 \n\nL.  Csato.  E.  Fokoue.  M.  Opper.  B.  Schottky and O.  Winther \n\nwhich follows from  (4)  by  a standard Gaussian integration, introducing the Fourier \ntransform  of  the  Likelihood  .P(Ylz)  = I  g~ eiaz P(Yla)  with  i  being  the  imaginary \nunit.  It is  tempting to view  (8)  as  a  normalizing partition function  for  a  Gaussian \nprocess  Zi  having  covariance  matrix  K-l and  likelihood  P.  Unfortunately,  P is \nnot  a  real number  and  precludes  a  proper  probabilistic interpretation.  N everthe(cid:173)\nless,  dealing formally with the complex measure defined by  (8), integration by parts \nshows that one has YjCl!j  = -i(zj)., where the brackets  ( ... ). denote a average over \nthe  complex  measure.  This  suggests  a  simple  approximation for  calculating  the \nCl!j.  One  may  think  of trying  a  saddle-point  (or  steepest  descent)  approximation \nto  (8)  and  replace  (Zj).  by  the  value  of Zj  (in  the  complex Z plane)  which  makes \nthe integrand stationary thereby  neglecting the fluctuations  of the Zj.  Hence,  this \napproximation would treat expectations of products as  (ZiZj).  as  (Zi).(Zj)*,  which \nmay  be  reasonable for  i  i=  j,  but  definitely  not  for  the  self-correlation i  = j.  Ac(cid:173)\ncording  to  the  general  formalism of mean  field  theories  (outlined  e.g.  in  [6]),  one \ncan improve on that idea, by treating the  'self-interactions' z;  separately.  This can \nbe done by replacing all Zi  (except in the form zi) by a new variable J.Li  by inserting \na  Dirac 8 function  representation 8(z - J.L)  = J ~r;:e-im(z-J1.)  into (8)  and integrate \nover  the  Z  and  a  variables  exactly  (the  integral factorizes),  and  finally  perform  a \nsaddle-point integration over the m  and J.L  variables.  The details of this calculation \nwill be given elsewhere.  Within the saddle-point approximation, we  get the system \nof nonlinear equations \n\nwhich  is  of  the  same  form  as  (6)  with  Aj  replaced  by  the  simpler  K jj .  These \nequations have also been derived by us in [5]  using a Callen identity, but our present \nderivation allows also for an approximation to the evidence.  By plugging the saddle(cid:173)\npoint values back into the partition function,  we  get \n\ni,i=l=j \n\ni,i=l=j \n\n-lnP(D) ~ - ~ln~ (Y'~) + ~ ~Y'Cl!'(K\"  - 8\u00b7 \u00b7K .. )y\u00b7Cl!\u00b7 \n\n~  t  V Kii \n\n2 ~ t \n\ntJ \n\nt \n\ntJ \n\n11 \n\nJ \n\nJ \n\nt \n\n~ \n\nwhich  is  also simpler  to  compute than  (7)  but  does  not  give  a  bound  on  the  true \nevidence. \n\n5  A  sequential  Approach \n\nBoth previous algorithms do not give an explicit expression for  the posterior mean, \nbut  require  the solution of a  set  of nonlinear equations.  These  must  be  obtained \nby an iterative procedure.  We  now present a different approach for  an approximate \ncomputation  of  the  posterior  mean,  which  is  based  on  a  single  sequential  sweep \nthrough the whole  dataset giving an explicit update of the posterior. \n\nThe algorithm is based on a recently proposed Bayesian approach to online learning \n(see  [8]  and the articles of Opper and Winther& Solla in [9]).  Its basic idea applied \nto  the  Gaussian  process  scenario,  is  as  follows:  Suppose,  that  qt  is  a  Gaussian \napproximation to the posterior after  having seen  t  examples.  This means  that we \napproximate the  posterior  process  by  a  Gaussian  process  with  mean  (a(x))t  and \ncovariance  Kt(x, y),  starting  with  (a(x))o  = 0  and  Ko(x, y)  =  K(x, y).  After  a \nnew  data point Yt+l  is  observed,  the posterior is updated  according to Bayes rule. \nThe  new  non-Gaussian posterior qt  is  projected back into the  family  of Gaussians \nby  choosing the closest  Gaussian qt+l  minimizing the relative entropy K L(qt, qt+d \n\n\fEfficient Approaches to Gaussian Process Classification \n\n255 \n\nin  order to  keep  the  loss  of information small.  This  projection  is  equivalent  to  a \nmatching of the first  two moments of lit  and qt+1 '  E.g., for  the first  moment we  get \n\n(a(x))t+l  = \n\n(P( \n\n(a(x)  P(Yt+1la(Xt+d))t \n\nI ( \n\nYt+l  a  Xt+l \n\n))) \nt \n\n= (a(x))t  + Kl (t)Kt(x, Xt+l) \n\nwhere  the  second  line  follows  again  from  an  integration  by  parts  and  Kt{t) \n1!:!\u00b1!. ~ with  z  - Yt\u00b1l (a(Xt\u00b1l)t  and  ~2(t) - K  (x \nx \nthe corresponding one for  K t  can be solved by the ansatz \n\nt+l,  t+l  . \n\n)  This recursion and \n\nu \n\ntJ>(zt} \n\n-\n\nt \n\nu(t) \n\nt  -\n\n(a(x))t \n\nj=l \n\nt L K(x, xj)Yjaj(t) \nL K(x, Xi)Cij (t)K(x, Xj) + K(x, y) \ni,j \n\n(10) \n\n(11) \n\nwhere  the  vector  a(t)  =  (al,\"\"  at, 0, 0, ... )  and  the  matrix C(t)  (which  has  also \nonly txt nonzero elements)  are updated as \n\na(t + 1) \nC(t + 1) \n\na(t) + Kl(t) (C(t)kt+l + et+l) @y \nC(t) + K2(t)  (C(t)kt+l + et+1) (C(t)kt+1 + et+1f \n\n(12) \n\nwhere  K2 (t)  =  ~ { ~~~:\u00bb) - (~gt\\) r},  k t \n\nis  the  vector  with  elements  K tj , \n\nj  =  1 ... , t and  @  denotes the element-wise product between  vectors.  The sequen(cid:173)\ntial  algorithm defined  by  (10)-(12)  has  the advantage of not requiring any  matrix \ninversions.  There is also no need to solve a numerical optimization problem at each \ntime as in the approach of [11]  where a  different update of a Gaussian posterior ap(cid:173)\nproximation was proposed.  Since we do not require a linearization of the likelihood, \nthe method is  not equivalent to the extended Kalman Filter approach. \nSince  it  is  possible  to  compute  the  evidence  of  the  new  datapoint  P(Yt+1)  = \n(P(Yt+1lat+d)t  based on the old posterior, we can compute a further approximation \nto the log evidence for  m  data via In P(Dm)  =  2:~lln(P(Yt+llat+1)k \n\n6  Simulations \n\nWe present two sets of simulations for the mean field approaches.  In the first, we test \nthe Bayesian evidence framework for  tuning the hyperparameters of the covariance \nfunction  (kernel).  In  the  second,  we  test  the  ability of the sequential approach to \nachieve low  training error and a  stable test error for  fixed  hyperparameters. \n\nFor  the  evidence  framework,  we  give  simulation  results  for  both  mean  field  free \nenergies  (7)  and  (10)  on  a  single  data  set,  'Pima Indian  Diabetes  (with  200/332 \ntraining/test-examples and  input  dimensionality  d  =  7)  [7].  The  results  should \ntherefore not be taken as  a  conclusive evidence for  the merits of these  approaches, \nbut  simply  as  an  indication  that  they  may  give  reasonable  results.  We  use  the \nradial basis function covariance function  K(x,x') =  exp (-~ 2:~WI(XI - XD2) \n. A \ndiagonal term v is added to the covariance matrix corresponding to a Gaussian noise \nadded  to  the  fields  with  variance  v  [5].  The  free  energy,  -lnP(D)  is  minimized \nby  gradient  descent  with  respect  to  v  and  the  lengthscale  parameters  WI, \u2022 .. , Wd \nand the  mean field  equations for  aj  are solved by iteration before  each  update of \nthe hyperparameters  (further  details will  be given elsewhere).  Figure  1 shows  the \nevolution of the  naive  mean  free  energy  and  the  test  error  starting from  uniform \n\n\f256 \n\nL.  Csat6, E.  Fokoue,  M.  Opper,  B.  &hottky and 0.  Winther \n\nws. \nIt  typically  requires  of the  order  of  10  iteration  steps  of  the  a;-equations \nbetween each hyperparameter update.  We  also used hybrid approaches, where the \nfree  energy  was  minimized  by  one  mean field  algorithm and the  hyperparameters \nused in the other.  As it may be seen from  table 1,  the naive mean field  theory can \noverestimate the free  energy  (since the ensemble free  energy is  an upper bound to \nthe free  energy).  The overestimation is not nearly as severe at the minimum of the \nnaive mean field  free  energy.  Another interesting observation is that as long as the \nsame  hyperparameters are  used  the  actual performance  (as  measured  by  the test \nerror)  is  not  very  sensitive to the algorithm used.  This  also seems to be  the case \nfor  the TAP mean field  approach and Support Vector Machines [5]. \n\n115..-----~----~-------,-----__, \n\n74..-----~----~-------,-----__, \n\n0114  \\\\ \nIi:\" \n' \n.E \n1.113 \n>. \n~ \nQ) c: \n~112 \n~ u... \n\n111 \n\no \n\n, , \n\n\\ \n\n, \n\n'. \"'-----\n\n70 g \nW68 \n~ \n1-66 \n\n64 \n\n20 \n\n~ \n\n60 \n\n~ \n\n62 \n\n0~----2~0----~~~--~~~--~~ \n\nIterations \n\nIterations \n\nFigure  1:  Hyperparameter optimization for  the  Pima Indians  data  set  using  the \nnaive  mean  field  free  energy.  Left  figure:  The  free  energy  as  a  function  of  the \nnumber of hyperparameter updates.  Right figure:  The test error count (out of 332) \nas  a function  of the number of hyperparameter updates. \n\nTable  1:  Pima Indians  dataset.  Hyperparameters found  by  free  energy minimiza(cid:173)\ntion.  Left column gives the free energy -lnP(D) used in hyperparameter optimiza(cid:173)\ntion.  Test error counts in range 63- 75  have previously been reported  [5] \nNaive MF \n\nEnsemble MF \n\nFree Energy minimization \nEnsemble Mean Field, eq.  (7) \nNaive Mean Field, eq.  (10) \n\nError \n\n-lnP(D)  Error \n\n72 \n62 \n\n100.6 \n107.0 \n\n70 \n62 \n\n-lnP(D) \n\n183.2 \n110.9 \n\nFor  the sequential algorithm, we  have studied the  sonar [10]  and  crab  [7]  datasets. \nSince we have not computed an approximation to the evidence so far,  a simple fixed \npolynomial kernel was  used.  Although a probabilistic justification of the algorithm \nis only valid, when a  single sweep through the data is used (the independence of the \ndata is assumed), it is tempting to reuse the same data and iterate the procedure as a \nheuristic.  The two plots show that in this way, only a small improvement is obtained, \nand it seems that the method is  rather efficient in extracting the information from \nthe data in a  single  presentation.  For  the sonar dataset,  a  single sweep  is  enough \nto achieve zero training error. \nAcknowledgements:  BS  would  like  to  thank  the  Leverhulme  Trust  for  their  support \n(F /250/K).  The work  was  also  supported by EPSRC  Grant GR/L52093. \n\n\fEfficient Approaches to Gaussian Process Classification \n\n257 \n\nso \n\n45 \n\n40 \n\n35 \n\n30 \n\n25 \n\n20 \n\n15 \n\n10 \n\nI \n1,- ~ \n\nTraining Error \n\n-\n- - Test Error \n\n35 \n\n30 \n\n... \n\ng25 \nw \n\n00 \n\n20 \n\n40 \n\n60 \n\n80 \n\n-\" \n\n20 \n\n15 \n\n10 \n\n5 \n\n1'\\ .. \" \n\n\\  ___ \n\n\\  ,- - ..  -\n\n-\n\n-\n\n-\n\n-\n\n.. \n\n140 \n\n160 \n\n\u00b00L----5~0---F=::::lIQ.-A.1SO'::-----::-2~OO \n\nHeration. \n\nFigure  2:  Training  and  test  errors  during  learning  for  the  sonar  (left)  and  crab \ndataset  (right).  The  vertical  dash-dotted  line  marks  the  end  of the  training  set \nand  the  starting  point  of  reusing  of  it.  The  kernel  function  used  is  K(x, x') \n(1 + x  . x' jm)k  with order k  =  2  (m is  the dimension of inputs) . \n\nReferences \n\n[1]  Williams  C.K.I.  and Rasmussen  C.E.,  Gaussian  Processes  for  Regression,  in  Neural \nInformation  Processing  Systems  8,  Touretzky  D.s,  Mozer  M.C.  and Hasselmo  M.E. \n(eds.),  514-520,  MIT  Press  (1996) . \n\n[2]  Neal  R.M,  Monte  Carlo  Implementation  of  Gaussian  Process  Models  for  Bayesian \n\nRegression  and  Classification,  Technical  Report  9702,  Department of Statistics,  Uni(cid:173)\nversity of Toronto  (1997). \n\n[3]  Gibbs  M.N.  and  Mackay  D.J.C.,  Variational  Gaussian  Process  Classifiers,  Preprint \n\nCambridge University  (1997). \n\n[4]  Williams C.K.I. and Barber D, Bayesian Classification with Gaussian Processes,  IEEE \n\nTrans  Pattern Analysis  and Machine  Intelligence,  20 1342-1351  (1998). \n\n[5]  Opper  M.  and  Winther  O.  Gaussian  Processes \n\nfor  Classification:  Mean \nto  Neural  Computation,  http://www.thep.lu.se \n\nField  Algorithms,  Submitted \n/tf2/staff/winther/  (1999). \n\n[6]  Zinn-Justin J,  Quantum Field  Theory  and  Critical Phenomena,  Clarendon Press Ox(cid:173)\n\nford  (1990) . \n\n[7]  Ripley  B.D,  Pattern  Recognition  and  Neural  Networks,  Cambridge  University  Press \n\n(1996). \n\n[8]  Opper M., Online versus  Offline  Learning from  Random Examples:  General Results, \n\nPhys.  Rev.  Lett. 77,  4671  (1996). \n\n[9]  Online  Learning  in  Neural  Networks,  Cambridge  University  Press,  D.  Saad  (ed.) \n\n(1998). \n\n[10]  Gorman R.P and Sejnowski T .J,  Analysis of hidden units in a layered network trained \n\nto  classify  sonar targets,  Neural  Networks 1,  (1988). \n\n[11]  Jaakkola  T .  and  Haussler  D.  Probabilistic  kernel  regression,  In  Online  Proceedings \n(1999) , \n\nof \nhttp://uncertainty99.microsoft.com/proceedings.htm. \n\nWorkshop \n\n7-th \n\nand \n\nStatistics \n\non \n\nAI \n\nInt. \n\n\f", "award": [], "sourceid": 1694, "authors": [{"given_name": "Lehel", "family_name": "Csat\u00f3", "institution": null}, {"given_name": "Ernest", "family_name": "Fokou\u00e9", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Bernhard", "family_name": "Schottky", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}