{"title": "Online Learning of Dynamic Parameters in Social Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2013, "page_last": 2021, "abstract": "This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period. Unlike many existing approaches, the underlying state is dynamic, and evolves according to a geometric random walk. We view the scenario as an optimization problem where agents aim to learn the true state while suffering the smallest possible loss. Based on the decomposition of the global loss function, we introduce two update mechanisms, each of which generates an estimate of the true state. We establish a tight bound on the rate of change of the underlying state, under which individuals can track the parameter with a bounded variance. Then, we characterize explicit expressions for the steady state mean-square deviation(MSD) of the estimates from the truth, per individual. We observe that only one of the estimators recovers the optimal MSD, which underscores the impact of the objective function decomposition on the learning quality. Finally, we provide an upper bound on the regret of the proposed methods, measured as an average of errors in estimating the parameter in a finite time.", "full_text": "Online Learning of Dynamic Parameters\n\nin Social Networks\n\n1Department of Electrical and Systems Engineering, 2Department of Statistics\n\nAli Jadbabaie 1\n\nShahin Shahrampour 1\n\nAlexander Rakhlin 2\n\nUniversity of Pennsylvania\nPhiladelphia, PA 19104 USA\n\n1{shahin,jadbabai}@seas.upenn.edu 2rakhlin@wharton.upenn.edu\n\nAbstract\n\nThis paper addresses the problem of online learning in a dynamic setting. We\nconsider a social network in which each individual observes a private signal about\nthe underlying state of the world and communicates with her neighbors at each\ntime period. Unlike many existing approaches, the underlying state is dynamic,\nand evolves according to a geometric random walk. We view the scenario as an\noptimization problem where agents aim to learn the true state while suffering the\nsmallest possible loss. Based on the decomposition of the global loss function, we\nintroduce two update mechanisms, each of which generates an estimate of the true\nstate. We establish a tight bound on the rate of change of the underlying state, un-\nder which individuals can track the parameter with a bounded variance. Then, we\ncharacterize explicit expressions for the steady state mean-square deviation(MSD)\nof the estimates from the truth, per individual. We observe that only one of the\nestimators recovers the optimal MSD, which underscores the impact of the objec-\ntive function decomposition on the learning quality. Finally, we provide an upper\nbound on the regret of the proposed methods, measured as an average of errors in\nestimating the parameter in a \ufb01nite time.\n\n1\n\nIntroduction\n\nIn recent years, distributed estimation, learning and prediction has attracted a considerable attention\nin wide variety of disciplines with applications ranging from sensor networks to social and economic\nnetworks [1\u20136]. In this broad class of problems, agents aim to learn the true value of a parameter\noften called the underlying state of the world. The state could represent a product, an opinion, a\nvote, or a quantity of interest in a sensor network. Each agent observes a private signal about the\nunderlying state at each time period, and communicates with her neighbors to augment her imperfect\nobservations. Despite the wealth of research in this area when the underlying state is \ufb01xed (see\ne.g. [1\u20133, 7]), often the state is subject to some change over time(e.g. the price of stocks) [8\u201311].\nTherefore, it is more realistic to study models which allow the parameter of interest to vary. In\nthe non-distributed context, such models have been studied in the classical literature on time-series\nprediction, and, more recently, in the literature on online learning under relaxed assumptions about\nthe nature of sequences [12]. In this paper we aim to study the sequential prediction problem in the\ncontext of a social network and noisy feedback to agents.\nWe consider a stochastic optimization framework to describe an online social learning problem when\nthe underlying state of the world varies over time. Our motivation for the current study is the results\nof [8] and [9] where authors propose a social learning scheme in which the underlying state follows\na simple random walk. However, unlike [8] and [9], we assume a geometric random walk evolution\nwith an associated rate of change. This enables us to investigate the interplay of social learning,\nnetwork structure, and the rate of state change, especially in the interesting case that the rate is\n\n1\n\n\fgreater than unity. We then pose the social learning as an optimization problem in which individuals\naim to suffer the smallest possible loss as they observe the stream of signals. Of particular relevance\nto this work is the work of Duchi et al.\nin [13] where the authors develop a distributed method\nbased on dual averaging of sub-gradients to converge to the optimal solution. In this paper, we\nrestrict our attention to quadratic loss functions regularized by a quadratic proximal function, but\nthere is no \ufb01xed optimal solution as the underlying state is dynamic.\nIn this direction, the key\nobservation is the decomposition of the global loss function into local loss functions. We consider\ntwo decompositions for the global objective, each of which gives rise to a single-consensus-step\nbelief update mechanism. The \ufb01rst method incorporates the averaged prior beliefs among neighbors\nwith the new private observation, while the second one takes into account the observations in the\nneighborhood as well. In both scenarios, we establish that the estimates are eventually unbiased, and\nwe characterize an explicit expression for the mean-square deviation(MSD) of the beliefs from the\ntruth, per individual. Interestingly, this quantity relies on the whole spectrum of the communication\nmatrix which exhibits the formidable role of the network structure in the asymptotic learning. We\nobserve that the estimators outperform the upper bound provided for MSD in the previous work [8].\nFurthermore, only one of the two proposed estimators can compete with the centralized optimal\nKalman Filter [14] in certain circumstances. This fact underscores the dependence of optimality\non decomposition of the global loss function. We further highlight the in\ufb02uence of connectivity on\nlearning by quantifying the ratio of MSD for a complete versus a disconnected network. We see that\nthis ratio is always less than unity and it can get arbitrarily close to zero under some constraints.\nOur next contribution is to provide an upper bound for regret of the proposed methods, de\ufb01ned as\nan average of errors in estimating the parameter up to a given time minus the long-run expected\nloss due to noise and dynamics alone. This \ufb01nite-time regret analysis is based on the recently\ndeveloped concentration inequalities for matrices and it complements the asymptotic statements\nabout the behavior of MSD.\nFinally, we examine the trade-off between the network sparsity and learning quality in a microscopic\nlevel. Under mild technical constraints, we see that losing each connection has detrimental effect on\nlearning as it monotonically increases the MSD. On the other hand, capturing agents communica-\ntions with a graph, we introduce the notion of optimal edge as the edge whose addition has the most\neffect on learning in the sense of MSD reduction. We prove that such a friendship is likely to occur\nbetween a pair of individuals with high self-reliance that have the least common neighbors.\n\n2 Preliminaries\n\n2.1 State and Observation Model\nWe consider a network consisting of a \ufb01nite number of agents V = {1, 2, ..., N}. The agents\nindexed by i \u2208V seek the underlying state of the world, xt \u2208 R, which varies over time and evolves\naccording to\n\nxt+1 = axt + rt,\n\n(1)\n\nwhere rt is a zero mean innovation, which is independent over time with \ufb01nite variance E[r2\nr,\nt ] = \u03c32\nand a \u2208 R is the expected rate of change of the state of the world, assumed to be available to all\nagents, and could potentially be greater than unity. We assume the initial state x0 is a \ufb01nite random\nvariable drawn independently by the nature. At time period t, each agent i receives a private signal\nyi,t \u2208 R, which is a noisy version of xt, and can be described by the linear equation\n\nyi,t = xt + wi,t,\n\n(2)\n\nwhere wi,t is a zero mean observation noise with \ufb01nite variance E[w2\nw, and it is assumed to\nbe independent over time and agents, and uncorrelated to the innovation noise. Each agent i forms\nan estimate or a belief about the true value of xt at time t conforming to an update mechanism that\nwill be discussed later. Much of the dif\ufb01culty of this problem stems from the hardness of tracking a\ndynamic state with noisy observations, especially when |a| > 1, and communication mitigates the\ndif\ufb01culty by virtue of reducing the effective noise.\n\ni,t] = \u03c32\n\n2\n\n\f2.2 Communication Structure\n\nAgents communicate with each other to update their beliefs about the underlying state of the world.\nThe interaction between agents is captured by an undirected graph G = (V,E), where V is the set\nof agents, and if there is a link between agent i and agent j, then {i, j}\u2208E . We let \u00afNi = {j \u2208\nbe the set of neighbors of agent i, and Ni = \u00afNi \u222a{ i}. Each agent i can only\nV : {i, j}\u2208E}\ncommunicate with her neighbors, and assigns a weight pij > 0 for any j \u2208 \u00afNi. We also let pii \u2265 0\ndenote the self-reliance of agent i.\nAssumption 1. The communication matrix P = [pij] is symmetric and doubly stochastic, i.e., it\nsatis\ufb01es\n\npij \u2265 0\n\n,\n\npij = pji\n\n, and \uffffj\u2208Ni\n\nN\uffffj=1\n\npij =\n\npij = 1.\n\nWe further assume the eigenvalues of P are in descending order and satisfy\n\n\u22121 <\u03bb N (P ) \u2264 ... \u2264 \u03bb2(P ) <\u03bb 1(P ) = 1.\n\n2.3 Estimate Updates\n\nThe goal of agents is to learn xt in a collaborative manner by making sequential predictions. From\noptimization perspective, this can be cast as a quest for online minimization of the separable, global,\ntime-varying cost function\n\nft(\u00afx) =\n\nmin\n\u00afx\u2208R\n\n1\nN\n\nN\uffffi=1\uffff \u02c6fi,t(\u00afx) \uffff 1\n\n2E\uffffyi,t \u2212 \u00afx\uffff2\uffff =\n\n1\nN\n\nN\uffffi=1\uffff \u02dcfi,t(\u00afx) \uffff\n\nN\uffffj=1\n\npij \u02c6fj,t(\u00afx)\uffff,\n\n(3)\n\nat each time period t. One approach to tackle the stochastic learning problem formulated above is to\nemploy distributed dual averaging regularized by a quadratic proximal function [13]. To this end, if\nagent i exploits \u02c6fi,t as the local loss function, she updates her belief as\n\nwhile using \u02dcfi,t as the local loss function results in the following update\n\npij \u02c6xj,t\n\nconsensus update\n\n\u02c6xi,t+1 = a\uffff \uffffj\u2208Ni\n\uffff\n\u02dcxi,t+1 = a\uffff \uffffj\u2208Ni\n\uffff\n\n\uffff\uffff\n\n\uffff\uffff\n\n\uffff\n\nconsensus update\n\npij \u02dcxj,t\n\n\uffff\uffff\n\n+ \u03b1(yi,t \u2212 \u02c6xi,t)\n\ninnovation update \uffff,\n\uffff\n\uffff\n\uffff\n+ \u03b1(\uffffj\u2208Ni\n\uffff\n\ninnovation update\n\npijyj,t \u2212 \u02dcxi,t)\n\n\uffff\uffff\n\n\uffff\n\n(4)\n\n(5)\n\n\uffff,\n\nwhere \u03b1 \u2208 (0, 1] is a constant step size that agents place for their innovation update, and we refer\nto it as signal weight. Equations (4) and (5) are distinct, single-consensus-step estimators differing\nin the choice of the local loss function with (4) using only private observations while (5) averaging\nobservations over the neighborhood. We analyze both class of estimators noting that one might\nexpect (5) to perform better than (4) due to more information availability.\nNote that the choice of constant step size provides an insight on the interplay of persistent innovation\nand learning abilities of the network. We remark that agents can easily learn the \ufb01xed rate of change\na by taking ratios of observations, and we assume that this has been already performed by the agents\nin the past. The case of a changing a is beyond the scope of the present paper. We also point out that\nthe real-valued (rather than vector-valued) nature of the state is a simpli\ufb01cation that forms a clean\nplayground for the study of the effects of social learning, effects of friendships, and other properties\nof the problem.\n\n2.4 Error Process\nDe\ufb01ning the local error processes \u02c6\u03bei,t and \u02dc\u03bei,t, at time t for agent i, as\n\u02dc\u03bei,t \uffff \u02dcxi,t \u2212 xt,\n\n\u02c6\u03bei,t \uffff \u02c6xi,t \u2212 xt\n\nand\n\n3\n\n\fand stacking the local errors in vectors \u02c6\u03bet, \u02dc\u03bet \u2208 RN, respectively, such that\n\u02dc\u03bet \uffff [\u02dc\u03be1,t, ..., \u02dc\u03beN,t]T,\n\n(6)\none can show that the aforementioned collective error processes could be described as a linear dy-\nnamical system.\nLemma 2. Given Assumption 1, the collective error processes \u02c6\u03bet and \u02dc\u03bet de\ufb01ned in (6) satisfy\n\n\u02c6\u03bet \uffff [\u02c6\u03be1,t, ..., \u02c6\u03beN,t]T\n\nand\n\nrespectively, where\n\nand\n\n\u02c6\u03bet+1 = Q\u02c6\u03bet + \u02c6st\n\nand\n\n\u02dc\u03bet+1 = Q\u02dc\u03bet + \u02dcst,\n\nQ = a(P \u2212 \u03b1IN ),\n\n\u02c6st = (\u03b1a)[w1,t, ..., wN,t]T \u2212 rt1N\n\nwith 1N being vector of all ones.\n\nand\n\n\u02dcst = (\u03b1a)P [w1,t, ..., wN,t]T \u2212 rt1N ,\n\n(7)\n\n(8)\n\n(9)\n\nThroughout the paper, we let \u03c1(Q), denote the spectral radius of Q, which is equal to the largest\nsingular value of Q due to symmetry.\n\n3 Social Learning: Convergence of Beliefs and Regret Analysis\n\nIn this section, we study the behavior of estimators (4) and (5) in the mean and mean-square sense,\nand we provide the regret analysis.\nIn the following proposition, we establish a tight bound for a, under which agents can achieve\nasymptotically unbiased estimates using proper signal weight.\nProposition 3 (Unbiased Estimates). Given the network G with corresponding communication ma-\ntrix P satisfying Assumption 1, the rate of change of the social network in (4) and (5) must respect\nthe constraint\n\nto allow agents to form asymptotically unbiased estimates of the underlying state.\n\n2\n\n|a| <\n\n,\n\n1 \u2212 \u03bbN (P )\n\nProposition 3 determines the trade-off between the rate of change and the network structure. In other\nwords, changing less than the rate given in the statement of the proposition, individuals can always\ntrack xt with bounded variance by selecting an appropriate signal weight. However, the proposition\ndoes not make any statement on the learning quality. To capture that, we de\ufb01ne the steady state\nMean Square Deviation(MSD) of the network from the truth as follows.\nDe\ufb01nition 4 ((Steady State-)Mean Square Deviation). Given the network G with a rate of change\nwhich allows unbiased estimation, the steady state of the error processes in (7) is de\ufb01ned as follows\n\n\u02c6\u03a3 \uffff lim\nt\u2192\u221e\n\nE[\u02c6\u03bet \u02c6\u03beT\nt ]\n\nand\n\n\u02dc\u03a3 \uffff lim\nt\u2192\u221e\n\nE[\u02dc\u03bet \u02dc\u03beT\nt ].\n\nHence, the (Steady State-)Mean Square Deviation of the network is the deviation from the truth in\nthe mean-square sense, per individual, and it is de\ufb01ned as\n\n\u02c6MSD \uffff 1\nN\n\nTr( \u02c6\u03a3)\n\nand\n\n\u02dcMSD \uffff 1\nN\n\nTr( \u02dc\u03a3).\n\nTheorem 5 (MSD). Given the error processes (7) with \u03c1(Q) < 1, the steady state MSD for (4) and\n(5) is a function of the communication matrix P , and the signal weight \u03b1 as follows\n\u02c6MSD(P, \u03b1) = RM SD(\u03b1) + \u02c6WM SD(P, \u03b1)\nwhere\n\n\u02dcMSD(P, \u03b1) = RM SD(\u03b1) + \u02dcWM SD(P, \u03b1), (10)\n\nand\n\u02c6WM SD(P, \u03b1) \uffff 1\nN\n\nRM SD(\u03b1) \uffff\n\n\u03c32\nr\n\n1 \u2212 a2(1 \u2212 \u03b1)2 ,\n\nN\uffffi=1\n\na2\u03b12\u03c32\nw\n\n1 \u2212 a2(\u03bbi(P ) \u2212 \u03b1)2 and\n\n\u02dcWM SD(P, \u03b1) \uffff 1\nN\n\nN\uffffi=1\n\n4\n\n(11)\n\n(12)\n\na2\u03b12\u03c32\n\nw\u03bb2\n\ni (P )\n\n1 \u2212 a2(\u03bbi(P ) \u2212 \u03b1)2 .\n\n\fTheorem 5 shows that the steady state MSD is governed by all eigenvalues of P contributing to\nWM SD pertaining to the observation noise, while RM SD is the penalty incurred due to the inno-\nvation noise. Moreover, (5) outperforms (4) due to richer information diffusion, which stresses the\nimportance of global loss function decomposition.\nOne might advance a conjecture that a complete network, where all individuals can communicate\nwith each other, achieves a lower steady state MSD in the learning process since it provides the\nmost information diffusion among other networks. This intuitive idea is discussed in the following\ncorollary beside a few examples.\nCorollary 6. Denoting the complete, star, and cycle graphs on N vertices by KN, SN, and CN, re-\nspectively, and denoting their corresponding Laplacians by LKN , LSN , and LCN , under conditions\nof Theorem 5,\n(a) For P = I \u2212 1\u2212\u03b1\n\nN LKN , we have\nlim\nN\u2192\u221e\n\n\u02c6MSDKN = RM SD(\u03b1) + a2\u03b12\u03c32\nw.\n\n(b) For P = I \u2212 1\u2212\u03b1\n\nN LSN , we have\n\n\u02c6MSDSN = RM SD(\u03b1) +\n\nlim\nN\u2192\u221e\n\na2\u03b12\u03c32\nw\n\n1 \u2212 a2(1 \u2212 \u03b1)2 .\n\n(c) For P = I \u2212 \u03b2LCN , where \u03b2 must preserve unbiasedness, we have\na2\u03b12\u03c32\nw\n\n\u02c6MSDCN = RM SD(\u03b1) +\uffff 2\u03c0\n\n0\n\n1 \u2212 a2(1 \u2212 \u03b2(2 \u2212 2 cos(\u03c4 )) \u2212 \u03b1)2\n\nlim\nN\u2192\u221e\n(d) For P = I \u2212 1\n\nN LKN , we have\n\n\u02dcMSDKN = RM SD(\u03b1).\n\nlim\nN\u2192\u221e\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\nd\u03c4\n2\u03c0\n\n.\n\nProof. Noting that the spectrum of LKN , LSN and LCN are, respectively [15], {\u03bbN = 0,\u03bb N\u22121 =\nN )}N\u22121\nN, ..., \u03bb1 = N},{\u03bbN = 0,\u03bb N\u22121 = 1, ...,\u03bb 2 = 1,\u03bb 1 = N}, and {\u03bbi = 2 \u2212 2 cos( 2\u03c0i\ni=0 ,\nsubstituting each case in (10), and taking the limit over N, the proof follows immediately.\n\nTo study the effect of communication let us consider the estimator (4). Under purview of Theorem\n5 and Corollary 6, the ratio of the steady state MSD for a complete network (13) versus a fully\ndisconnected network(P = IN ) can be computed as\n\n\u02c6MSDKN\n\n\u03c32\nr + a2\u03b12\u03c32\n\nw(1 \u2212 a2(1 \u2212 \u03b1)2)\n\u03c32\nr + a2\u03b12\u03c32\nw\n\n=\n\nr \uffff \u03c32\n\n\u02c6MSDdisconnected\n\nlim\nN\u2192\u221e\nw. The ratio above can get arbitrary close to zero which, indeed, highlights the in\ufb02uence\n\nfor \u03c32\nof communication on the learning quality.\nWe now consider Kalman Filter(KF) [14] as the optimal centralized counterpart of (5). It is well-\nknown that the steady state KF satis\ufb01es a Riccati equation, and when the parameter of interest is\nscalar, the Riccati equation simpli\ufb01es to a quadratic with the positive root\n\n\u2248 1 \u2212 a2(1 \u2212 \u03b1)2,\n\n\u03a3KF =\n\na2\u03c32\n\nw \u2212 \u03c32\n\nw + N\u03c3 2\n\nw + N\u03c3 2\n\nr )2 + 4N\u03c3 2\n\nw\u03c32\nr\n\n.\n\nr +\uffff(a2\u03c32\n\nw \u2212 \u03c32\n2N\n\nTherefore, comparing with the complete graph (16), we have\n\u03c32\nr\n\n1 \u2212 a2(1 \u2212 \u03b1)2 ,\nand the upper bound can be made tight by choosing \u03b1 = 1 for |a| <\nwe should choose an \u03b1< 1 to preserve unbiasedness as well.\n\n\u03a3KF = \u03c32\n\nlim\nN\u2192\u221e\n\nr \u2264\n\n1\n\n|\u03bbN (P )\u22121|\n\n. If |a|\u2265\n\n1\n\n|\u03bbN (P )\u22121|\n\n5\n\n\fOn the other hand, to evaluate the performance of estimator (4), we consider the upper bound\n\nMSDBound =\n\n\u03c32\nr + \u03b12\u03c32\nw\n\n\u03b1\n\n,\n\n(17)\n\nderived in [8], for a = 1 via a distributed estimation scheme. For simplicity, we assume \u03c32\nr =\n\u03c32, and let \u03b2 in (15) be any diminishing function of N. Optimizing (13), (14), (15), and (17) over\n\u03b1, we obtain\n\nw = \u03c32\n\nlim\nN\u2192\u221e\n\n\u02c6MSDKN \u2248 1.55\u03c32 < lim\nN\u2192\u221e\n\n\u02c6MSDCN \u2248 1.62\u03c32 < MSDBound = 2\u03c32,\nwhich suggests a noticeable improvement in learning even in the star and cycle networks where the\nnumber of individuals and connections are in the same order.\n\n\u02c6MSDSN = lim\nN\u2192\u221e\n\nRegret Analysis\n\nWe now turn to \ufb01nite-time regret analysis of our methods. The average loss of all agents in predicting\nthe state, up until time T , is\n\n1\nT\n\n1\nN\n\nT\ufffft=1\n\nN\uffffi=1\n\n(\u02c6xi,t \u2212 xt)2 =\n\n1\nT\n\n1\nN\n\nTr(\u02c6\u03bet \u02c6\u03beT\n\nt ) .\n\nT\ufffft=1\n\nAs motivated earlier, it is not possible, in general, to drive this average loss to zero, and we need to\nsubtract off the limit. We thus de\ufb01ne regret as\n\nRT \uffff 1\nT\n\nT\ufffft=1\nwhere \u02c6\u03a3 is from De\ufb01nition 4. We then have for the spectral norm \uffff\u00b7\uffff that\n\nT\ufffft=1\n\nTr( \u02c6\u03a3) =\n\nTr(\u02c6\u03bet \u02c6\u03beT\n\nt ) \u2212\n\n1\nN\n\n1\nN\n\n1\nN\n\nTr\uffff 1\n\nT\n\nt \u2212 \u02c6\u03a3\uffff ,\n\n\u02c6\u03bet \u02c6\u03beT\n\n\u03bet\u03beT\n\n,\n\n(18)\n\n1\nT\n\nT\ufffft=1\nRT \u2264\uffff\uffff\uffff\uffff\uffff\n\n1\nT\n\nT\ufffft=1\n\nt \u2212 \u03a3\uffff\uffff\uffff\uffff\uffff\n\n\uffff A2\nt ,\n\nwhere we dropped the distinguishing notation between the two estimators since the analysis works\nfor both of them. We, \ufb01rst, state a technical lemma from [16] that we invoke later for bounding the\nquantity RT . For simplicity, we assume that magnitudes of both innovation and observation noise\nare bounded.\nLemma 7. Let {st}T\nt=1 be an independent family of vector valued random variables, and let H\nbe a function that maps T variables to a self-adjoint matrix of dimension N. Consider a sequence\n{At}T\n\nt=1 of \ufb01xed self-adjoint matrices that satisfy\n\nwhere \u03c9i and \u03c9\uffffi range over all possible values of si for each index i. Letting Var = \uffff\uffffT\nfor all c \u2265 0, we have\n\n\uffffH(\u03c91, ...,\u03c9 t, ...,\u03c9 T ) \u2212 H(\u03c91, ...,\u03c9 \ufffft, ...,\u03c9 T )\uffff2\nP\uffff\uffff\uffffH(s1, ..., sT ) \u2212 E[H(s1, ..., sT )]\uffff\uffff \u2265 c\uffff \u2264 N e\u2212c2/8Var.\n\uffff1 \u2212 \u03c12(Q)\uffff2\uffff +\nT\uffff \uffff\u03be0\uffff2\n1 \u2212 \u03c12(Q)\uffff +\nwith probability at least 1 \u2212 \u03b4.\nWe mention that results that are similar in spirit have been studied for general unbounded stationary\nergodic time series in [17\u201319] by employing techniques from the online learning literature. On the\nother hand, our problem has the network structure and the speci\ufb01c evolution of the hidden state, not\npresent in the above works.\n\nTheorem 8. Under conditions of Theorem 5 together with boundedness of noise maxt\u2264T \uffffst\uffff \u2264 s\nfor some s > 0, the regret function de\ufb01ned in (18) satis\ufb01es\n8s2\uffff2 log N\n(1 \u2212 \u03c1(Q))2 ,\n\nT\uffff 2s\uffff\u03be0\uffff\n\uffff1 \u2212 \u03c1(Q)\uffff2\uffff +\n\nt=1 A2\nt\uffff,\n\nT\uffff\n\nRT \u2264\n\n1\n\u221aT\n\n(19)\n\ns2\n\n1\n\n1\n\n1\n\n\u03b4\n\n6\n\n\f4 The Impact of New Friendships on Social Learning\n\nIn the social learning model we proposed, agents are cooperative and they aim to accomplish a global\nobjective. In this direction, the network structure contributes substantially to the learning process.\nIn this section, we restrict our attention to estimator (5), and characterize the intuitive idea that mak-\ning(losing) friendships can in\ufb02uence the quality of learning in the sense of decreasing(increasing)\nthe steady state MSD of the network.\nTo commence, letting ei denote the i-th unit vector in the standard basis of RN, we exploit the\nnegative semi-de\ufb01nite, edge function matrix\n\n(20)\nfor edge addition(removal) to(from) the graph. Essentially, if there is no connection between agents\ni and j,\n\n\u2206P (i, j) \uffff \u2212(ei \u2212 ej)(ei \u2212 ej)T,\n\nP\uffff \uffff P + \uffff\u2206P (i, j),\n\n(21)\nfor \uffff< min{pii, pjj}, corresponds to a new communication matrix adding the edge {i, j} with a\nweight \uffff to the network G, and subtracting \uffff from self-reliance of agents i and j.\nProposition 9. Let G\u2212 be the network resulted by removing the bidirectional edge {i, j} with the\nweight \uffff from the network G, so P\u2212\uffff and P denote the communication matrices associated to G\u2212\nand G, respectively. Given Assumption 1, for a \ufb01xed signal weight \u03b1 the following relationship holds\n(22)\n\n\u02dcMSD(P, \u03b1) \u2264 \u02dcMSD(P\u2212\uffff,\u03b1 ),\n\n.\n\nas long as P is positive semi-de\ufb01nite, and |a| < 1\n|\u03b1|\nUnder a mild technical assumption, Proposition 9 suggests that losing connections monotonically\nincreases the MSD, and individuals tend to maintain their friendships to obtain a lower MSD as a\nglobal objective. However, this does not elaborate on the existence of individuals with whom losing\nor making connections could have an immense impact on learning. We bring this concept to light\nin the following proposition with \ufb01nding a so-called optimal edge which provides the most MSD\nreduction, in case it is added to the network graph.\nProposition 10. Given Assumption 1, a positive semi-de\ufb01nite P , and |a| < 1\n, to \ufb01nd the optimal\n|\u03b1|\nedge with a pre-assigned weight \uffff \uffff 1 to add to the network G, we need to solve the following\noptimization problem\nN\uffffk=1\uffffhk(i, j) \uffff zk(i, j)\uffff2(1 \u2212 \u03b12a2)\u03bbk(P ) + 2a2\u03b1\u03bb2\nk(P )\uffff\n\uffff1 \u2212 a2(\u03bbk(P ) \u2212 \u03b1)2\uffff2\n\nmin\n{i,j} /\u2208E\n\n\uffff,\n\nwhere\n\n(23)\n\nzk(i, j) \uffff (vT\n\nk\u2206P (i, j)vk)\uffff,\n\n(24)\n\nk=1 are the right eigenvectors of P . In addition, letting \u03b6max = maxk>1 |\u03bbk(P ) \u2212 \u03b1|,\nand {vk}N\n\u22122\uffff\uffff(1 \u2212 \u03b12a2)(pii + pjj) + a2\u03b1([P 2]ii + [P 2]jj \u2212 2[P 2]ij)\uffff\nN\uffffk=1\n\nhk(i, j) \u2265 min\n{i,j} /\u2208E\n\nmin\n{i,j} /\u2208E\n\n.\n\n\uffff1 \u2212 a2\u03b62\n\nmax\uffff2\n\n(25)\n\nProof. Representing the \ufb01rst order approximation of \u03bbk(P\uffff) using de\ufb01nition of zk(i, j) in (24), we\nhave \u03bbk(P\uffff) \u2248 \u03bbk(P ) + zk(i, j) for \uffff \uffff 1. Based on Theorem 5, we now derive\n\u02dcMSD(P\uffff,\u03b1 ) \u2212 \u02dcMSD(P, \u03b1) \u221d\nN\uffffk=1\nN\uffffk=1\n\nN\uffffk=1\uffff\u03bbk(P\uffff) \u2212 \u03bbk(P )\uffff\uffff(1 \u2212 \u03b12a2)(\u03bbk(P\uffff) + \u03bbk(P )) + 2a2\u03b1\u03bbk(P )\u03bbk(P\uffff)\uffff\n\uffff1 \u2212 a2(\u03bbk(P ) \u2212 \u03b1)2\uffff\uffff1 \u2212 a2(\u03bbk(P\uffff) \u2212 \u03b1)2\uffff\nk(P ) + (1 \u2212 \u03b12a2 + 2a2\u03b1\u03bbk(P ))zk(i, j)\uffff\nzk(i, j)\uffff2(1 \u2212 \u03b12a2)\u03bbk(P ) + 2a2\u03b1\u03bb2\n\uffff1 \u2212 a2(\u03bbk(P ) \u2212 \u03b1)2\uffff\uffff1 \u2212 a2(\u03bbk(P ) \u2212 \u03b1 + zk(i, j))2\uffff\nzk(i, j)\uffff2(1 \u2212 \u03b12a2)\u03bbk(P ) + 2a2\u03b1\u03bb2\nk(P )\uffff\n\uffff1 \u2212 a2(\u03bbk(P ) \u2212 \u03b1)2\uffff2\n\n+ O(\uffff2),\n\n\u2248\n\n=\n\n7\n\n\f\u02dcMSD(P\uffff,\u03b1 ) \u2212 \u02dcMSD(P, \u03b1) is,\nnoting that zk(i, j) is O(\uffff) from the de\ufb01nition (24). Minimizing\nhence, equivalent to optimization (23) when \uffff \uffff 1. Taking into account that P is positive semi-\nde\ufb01nite, zk(i, j) \u2264 0 for k \u2265 2, and v1 = 1N /\u221aN which implies z1(i, j) = 0, we proceed to the\nlower bound proof using the de\ufb01nition of hk(i, j) and \u03b6max in the statement of the proposition, as\nfollows\n\nSubstituting zk(i, j) from (24) to above, we have\n\nN\uffffk=1\n\nhk(i, j) \u2265\n\n=\n\n=\n\nhk(i, j) =\n\nN\uffffk=1\n\n\u2265\n\n1\n\nk(P )\uffff\n\nzk(i, j)\uffff2(1 \u2212 \u03b12a2)\u03bbk(P ) + 2a2\u03b1\u03bb2\n\nzk(i, j)\uffff2(1 \u2212 \u03b12a2)\u03bbk(P ) + 2a2\u03b1\u03bb2\nN\uffffk=2\n\uffff1 \u2212 a2(\u03bbk(P ) \u2212 \u03b1)2\uffff2\nN\uffffk=2\n\uffff1 \u2212 a2\u03b62\nmax\uffff2\nmax\uffff2\uffff N\uffffk=1\uffffvT\nk\u2206P (i, j)vk\uffff\uffff(1 \u2212 \u03b12a2)\u03bbk(P ) + a2\u03b1\u03bb2\nmax\uffff2 Tr\uffff\u2206P (i, j)\nN\uffffk=1\uffff(1 \u2212 \u03b12a2)\u03bbk(P ) + a2\u03b1\u03bb2\nmax\uffff2 Tr\uffff\u2206P (i, j)\uffff(1 \u2212 \u03b12a2)P + a2\u03b1P 2\uffff\uffff.\n\nk(P )\uffff.\nk(P )\uffff\uffff\nk\uffff\nk(P )\uffffvkvT\n\n2\uffff\n\n2\uffff\n\n\uffff1 \u2212 a2\u03b62\n\uffff1 \u2212 a2\u03b62\n\uffff1 \u2212 a2\u03b62\n\n2\uffff\n\nUsing the facts that Tr(\u2206P (i, j)P ) = \u2212pii\u2212 pjj + 2pij and Tr(\u2206P (i, j)P 2) = \u2212[P 2]ii\u2212 [P 2]jj +\n2[P 2]ij according to de\ufb01nition of \u2206P (i, j) in (20), and pij = 0 since we are adding a non-existent\nedge {i, j}, the lower bound (25) is derived.\nBeside posing the optimal edge problem as an optimization, Proposition 10 also provides an up-\nper bound for the best improvement that making a friendship brings to the network. In view of\n(25), forming a connection between two agents with more self-reliance and less common neighbors,\nminimizes the lower bound, which offers the most maneuver for MSD reduction.\n\n5 Conclusion\n\nWe studied a distributed online learning problem over a social network. The goal of agents is to\nestimate the underlying state of the world which follows a geometric random walk. Each individual\nreceives a noisy signal about the underlying state at each time period, so she communicates with her\nneighbors to recover the true state. We viewed the problem with an optimization lens where agents\nwant to minimize a global loss function in a collaborative manner. To estimate the true state, we\nproposed two methodologies derived from a different decomposition of the global objective. Given\nthe structure of the network, we provided a tight upper bound on the rate of change of the param-\neter which allows agents to follow the state with a bounded variance. Moreover, we computed the\naveraged, steady state, mean-square deviation of the estimates from the true state. The key obser-\nvation was optimality of one of the estimators indicating the dependence of learning quality on the\ndecomposition. Furthermore, de\ufb01ning the regret as the average of errors in the process of learning\nduring a \ufb01nite time T , we demonstrated that the regret function of the proposed algorithms decays\n\nwith a rate O(1/\u221aT ). Finally, under mild technical assumptions, we characterized the in\ufb02uence of\n\nnetwork pattern on learning by observing that each connection brings a monotonic decrease in the\nMSD.\n\nAcknowledgments\n\nWe gratefully acknowledge the support of AFOSR MURI CHASE, ONR BRC Program on Decen-\ntralized, Online Optimization, NSF under grants CAREER DMS-0954737 and CCF-1116928, as\nwell as Dean\u2019s Research Fund.\n\n8\n\n\fReferences\n[1] M. H. DeGroot, \u201cReaching a consensus,\u201d Journal of the American Statistical Association,\n\nvol. 69, no. 345, pp. 118\u2013121, 1974.\n\n[2] A. Jadbabaie, P. Molavi, A. Sandroni, and A. Tahbaz-Salehi, \u201cNon-bayesian social learning,\u201d\n\nGames and Economic Behavior, vol. 76, no. 1, pp. 210\u2013225, 2012.\n\n[3] E. Mossel and O. Tamuz, \u201cEf\ufb01cient bayesian learning in social networks with gaussian esti-\n\nmators,\u201d arXiv preprint arXiv:1002.0747, 2010.\n\n[4] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, \u201cOptimal distributed online prediction\nusing mini-batches,\u201d The Journal of Machine Learning Research, vol. 13, pp. 165\u2013202, 2012.\n[5] L. Xiao, S. Boyd, and S. Lall, \u201cA scheme for robust distributed sensor fusion based on average\nconsensus,\u201d in Fourth International Symposium on Information Processing in Sensor Networks.\nIEEE, 2005, pp. 63\u201370.\n\n[6] S. Kar, J. M. Moura, and K. Ramanan, \u201cDistributed parameter estimation in sensor networks:\nNonlinear observation models and imperfect communication,\u201d IEEE Transactions on Informa-\ntion Theory, vol. 58, no. 6, pp. 3575\u20133605, 2012.\n\n[7] S. Shahrampour and A. Jadbabaie, \u201cExponentially fast parameter estimation in networks using\n\ndistributed dual averaging,\u201d arXiv preprint arXiv:1309.2350, 2013.\n\n[8] D. Acemoglu, A. Nedic, and A. Ozdaglar, \u201cConvergence of rule-of-thumb learning rules in\n\nsocial networks,\u201d in 47th IEEE Conference on Decision and Control, 2008, pp. 1714\u20131720.\n\n[9] R. M. Frongillo, G. Schoenebeck, and O. Tamuz, \u201cSocial learning in a changing world,\u201d in\n\nInternet and Network Economics. Springer, 2011, pp. 146\u2013157.\n\n[10] U. A. Khan, S. Kar, A. Jadbabaie, and J. M. Moura, \u201cOn connectivity, observability, and\nstability in distributed estimation,\u201d in 49th IEEE Conference on Decision and Control, 2010,\npp. 6639\u20136644.\n\n[11] R. Olfati-Saber, \u201cDistributed kalman \ufb01ltering for sensor networks,\u201d in 46th IEEE Conference\n\non Decision and Control, 2007, pp. 5492\u20135498.\n\n[12] N. Cesa-Bianchi, Prediction, learning, and games. Cambridge University Press, 2006.\n[13] J. C. Duchi, A. Agarwal, and M. J. Wainwright, \u201cDual averaging for distributed optimization:\nconvergence analysis and network scaling,\u201d IEEE Transactions on Automatic Control, vol. 57,\nno. 3, pp. 592\u2013606, 2012.\n\n[14] R. E. Kalman et al., \u201cA new approach to linear \ufb01ltering and prediction problems,\u201d Journal of\n\nbasic Engineering, vol. 82, no. 1, pp. 35\u201345, 1960.\n\n[15] M. Mesbahi and M. Egerstedt, Graph theoretic methods in multiagent networks.\n\nUniversity Press, 2010.\n\nPrinceton\n\n[16] J. A. Tropp, \u201cUser-friendly tail bounds for sums of random matrices,\u201d Foundations of Compu-\n\ntational Mathematics, vol. 12, no. 4, pp. 389\u2013434, 2012.\n\n[17] G. Biau, K. Bleakley, L. Gy\u00a8or\ufb01, and G. Ottucs\u00b4ak, \u201cNonparametric sequential prediction of\n\ntime series,\u201d Journal of Nonparametric Statistics, vol. 22, no. 3, pp. 297\u2013317, 2010.\n\n[18] L. Gyor\ufb01 and G. Ottucsak, \u201cSequential prediction of unbounded stationary time series,\u201d IEEE\n\nTransactions on Information Theory, vol. 53, no. 5, pp. 1866\u20131872, 2007.\n\n[19] L. Gy\u00a8or\ufb01, G. Lugosi et al., Strategies for sequential prediction of stationary time series.\n\nSpringer, 2000.\n\n9\n\n\f", "award": [], "sourceid": 1020, "authors": [{"given_name": "Shahin", "family_name": "Shahrampour", "institution": "University of Pennsylvania"}, {"given_name": "Sasha", "family_name": "Rakhlin", "institution": "University of Pennsylvania"}, {"given_name": "Ali", "family_name": "Jadbabaie", "institution": "University of Pennsylvania"}]}