Networks学习笔记


Networks学习笔记

第二部分——网络理论基础

1.网络数学模型

1) 无环图(Acyclic Network)

  • Acyclic Network:一些有向网络中不含环的部分。(典型的如引文网络)

  • 由于self-edge也是环,所以Acyclic Network不含自环。

  • Acyclic Network中至少一个节点仅有进入的连边而没有走出的连边。(反证法+游走法证明)

  • Acyclic Network所有边都可以画成向下指(point downward)的形式。

  • 可视化Acyclic Network的方法:

First, we search through the network for a node with no outgoing edges. There could be more than one such node, in which case we choose whichever one we like. Let us call this node 1. We now remove node 1 from the network, along with any edges attached to it, then we repeat the process, finding another node with no outgoing edges in the remaining network. We call this node 2, remove it from the network along with its edges, and so forth. After all nodes have been numbered and removed, we put the network back together again and draw a picture of it by placing the nodes in numerical order from bottom to top of the page and then drawing the directed edges in the appropriate positions between them. Every node has outgoing edges only to lower numbered nodes—those drawn below it in the picture—because it had no outgoing edges at the time it was removed from the network, so all its original outgoing edges (if it ever had any) must have been connected to nodes that were removed earlier. Thus all edges in the final picture must be pointing downward.

  • 无环图和有环图检测算法:
  1. Find a node with no outgoing edges.
  2. If no such node exists, the network is cyclic. Otherwise, if such a node does exist, remove it and all its ingoing edges from the network.
  3. If all nodes have been removed, the network is acyclic. Otherwise, go back to step 1.

2) 超图(Hypergraphs)

  • hyperedges:连接超过两个节点的边。
  • 超图可以有二分网络的表现形式。

3) 二分图(Bipartite Network)

  • 二分图有两类节点,其中连边只存在两类节点之间。
  • 有向的二分图在原理上也是可能存在的。(代谢网络)

4)关联矩阵(Incidence Matrix)和网络映射(Network Projection)

  • 关联矩阵:二分图邻接矩阵的等价形式。(0/1归属形式表示)

  • 二分图是two-mode的网络映射,但我们可以从两类节点分别考虑而将二分图变成两个one-mode的网络映射:如果同类节点中的两个节点都连接另一类节点中的一个,那么在one-mode映射图中该两个节点之间存在连边,反之不存在。因此在one-mode映射图中只存在一种类别的节点。

5) 多层网络(Multilayer Network)

  • 多层网络是多个独立的网络,每层代表一种特定类型的节点和它们的连接,以及网络之间的互连边。

  • Multiplex Network是Multilayer Network的特殊情况,即Multiplex Network的每一层节点都表示同一类对象。

6) 树(Tree)

  • 树为不包括环的连通无向网络;也可以由两个或两个以上的部分(components)组成,它们之间互不相连,如果单独的部分没有环路,它也被称为树。如果网络的所有部分都是树,那么完整的网络就叫做森林(forest)。
  • 任何两个节点对之间只存在一条路径。
  • n个节点的树总是刚好有n-1条连边。反之亦然,任何有n个节点和n - 1条连边的连通网络是树。
  • 在n个节点上具有最少边数的连通网络始终是树。

7) 平面图(Planar Graphs)

  • 平面网络是指可以在平面上画出没有任何边交叉的网络。
  • 所有的树都是planar图。
  • 平面图的例子:路网络、美国州的邻接图。
  • 四色定理:对平面网络的节点进行着色,使由一条边连接的两个节点颜色不相同。关于总共需要的颜色数我们称之为chromatic number(色数),定理证明了平面网络的色数永远是4或更少。
  • 判定平面网络的方法:Kuratowski’s theorem(库拉托夫斯基定理)。

8) 度(Degree)

  • 无向网络中一个节点的度是与它相连的边的数量。
  • 无向网络的平均度等于该网络邻接矩阵所有项求和除以节点数。
  • 所有节点的度数都相同的网络叫做regular networks(正则图)。

9) 密度(Density)与稀疏度(Sparsity)

  • 网络的密度等于连边数与最大连边数的比值。其可以被看做是一对节点连接的概率,其大小为0到1之间的数。
  • 当节点数n变大时,如果密度ρ保持不为零,则网络是稠密的。当网络的密度ρ→0在大n的极限上是稀疏的,并且邻接矩阵中非零元素的比例趋于零。
  • 网络的平均度=网络密度×(n-1)≈网络密度×n。

10) 有向网络(Directed Network)

  • 度数分为出度和入度。
  • 有向网络的连边总数为邻接矩阵中的所有项求和。有向网络的平均度与无向网络的平均度存在2倍关系。

11) 游走(Walks)和路径(Paths)

  • A walk in a network is any sequence of nodes such that every consecutive pair of nodes in the sequence is connected by an edge.

  • Walks that do not intersect themselves are called paths or self-avoiding walks, and are important in many areas of network theory.

  • The length of a walk in a network is the number of edges traversed along the walk (not the number of nodes).

  • 不同Walks长度的数目可以通过邻接矩阵的不同阶次项求和得到。(p131)

12) 最短路径(Shortest Paths)

  • A shortest path in a network, also sometimes called a geodesic path, is the shortest walk between a given pair of nodes.

  • Among all shortest paths between every pair of nodes in the network for which a path actually exists, the diameter is the length of the longest one.

13) 连通片(Components)

  • Technically, a component is a subset of the nodes of a network such that there exists at least one path from each member of that subset to each other member, and such that no other node in the network can be added to the subset while preserving this property. (Subsets like this, to which no other node can be added while preserving a given property, are called maximal subsets.)

  • A network in which all nodes belong to the same single component is said to be connected.

  • Two nodes are in the same weakly connected component if they are connected by one or more paths through the network, where paths are allowed to go either way along any edge.

  • we define A and B to be connected if and only if there exists a directed path both from A to B and from B to A. In that case, A and B are said to be strongly connected. We can define components for a directed network using this definition of connection and these are called strongly connected components.

  • Out-components: an out-component is the set of nodes that are reachable via directed paths starting from a specified node A, and including A itself. (Out-component is a property of both the network structure and the starting node)

  • In both the in-component and the out-component ofAis necessarily also in its strongly connected component.

14) 独立路径(Independent paths),连通度(Connectivity)和割集(Cut Sets)

  • There are two species of independent path: edge-independent and node-independent. Two paths connecting a given pair of nodes are edge-independent if they share no edges. Two paths are node-independent if they share no nodes, other than their starting and ending nodes.

  • The number of independent paths (either edge- or node-independent) from A to B cannot exceed A’s degree, since every path must leave node A along a different edge. Similarly, the number of paths cannot exceed B’s degree either.

  • The number of independent paths between a pair of nodes is called the connectivity of the nodes.

  • A cut set, or more properly a node cut set, is a set of nodes whose removal (along with the adjacent edges) will disconnect a specified pair of nodes. (p139)

  • Menger’s theorem says that the size of the minimum cut set between any pair of nodes in a network is equal to the number of independent paths between the same nodes. (p139)

  • Menger’s theorem and the max-flow/min-cut theorem tell us that for a pair of nodes in an undirected network three quantities are all numerically equal to each other: the edge connectivity of the pair (i.e., the number of edge-independent paths connecting them), the size of the minimum edge cut set (i.e., the number of edges that must be removed to disconnect them), and the maximum flow between the nodes expressed as a multiple of the maximum flow along each individual edge. (p140)

15) 加权网络上的最大流量与割集 (Maximum flows and cut sets on weighted networks)

  • A minimum edge cut set is defined as being a cut set such that the sum of the weights on the edges of the set has the minimum possible value.

  • Maximum flow between a given pair of nodes in a network is equal to the sum of the weights on the edges of the minimum edge cut set that separates the same pair of nodes. (p141)

16) 图拉普拉斯 (Graph Laplacian)

  • The graph Laplacian for a simple undirected, unweighted network is an n × n symmetric matrix L with elements
  • Another way to write the same thing would be $L_{ij}=k_i\delta_{ij}-A_{ij}$. And $\delta_{ij}$ is Kronecker delta.
  • We can write $L$ in matrix form as $L=D-A$, where $D$ is the diagonal matrix with the node degrees along its diagonal.

  • One can also write a graph Laplacian for weighted networks.

  • One can also treat multigraphs in the same way.
  • No natural extension of the graph Laplacian to networks with self-edges or, more importantly, to directed networks. The Laplacian is only useful for the undirected case.

17) 图拉普拉斯应用——图分割 (Graph Partitioning)

  • Graph partitioning is the task of dividing the nodes of a network into a set of groups of given sizes so as to minimize the number of edges running between the groups. (p143)

  • The number of edges $R$ running between the two groups, also called the cut size: $R = \frac{1}{2} \Sigma_{i,j\ in\ different\ groups} A_{ij}$. We define a set of quantities $s_i$:

    Then

    which allows us to rewrite $R$:

    The first term in the sum is

    Substituting back (4) we then find that

    We can be written in matrix form as

  • The matrix $L$ specifies the structure of our network, the vector s defines a division of that network into groups, and our goal is to find the vectors that minimizes the cut size for given $L$. We can makes use of the eigenvectors of the graph Laplacian to rapidly find good divisions of the network.

18) 图拉普拉斯应用——网络可视化 (Network Visualization)

  • The distance between nodes i and j in our simple one-dimensional model is $|x_i−x_j|$ and the squared distance is $(x_i−x_j)^2$. The sum $∆^2$ of the squared distances for all node pairs connected by an edge is thenExpanding this expression, we haveEquation can be written in matrix notation as

19) 图拉普拉斯应用——随机游走(Random Walks)

  • A random walk is a walk across a network created by taking repeated random steps. Starting at any initial node, we choose uniformly at random among the edges attached to that node, move along the chosen edge to the node at its other end, and repeat the process.

  • Let $p_i(t)$ be the probability that the walk is at node $i$ at time $t$. If the walk is at node $j$ at time $t−1$, the probability of taking a step along any particular one of the $k_j$ edges attached to $j$ is $\frac{1}{k_j}$ , so on an undirected network the probability of being at node $i$ on the next step is given by

    or $p(t)=AD^{−1}p(t − 1)$ in matrix form, where $p$ is the vector with elements $p_i$ and, as before, $D$ is the diagonal matrix with the degrees of the nodes down its diagonal.

    In the limit of long time the probability distribution over nodes is given by (11) with $t$ set to infinity: $p_i(∞)=\Sigma_jA_{ij}p_i(∞)/k_j$ , or in matrix form:

    where $p$ is shorthand for $p(∞)$. Rearranging, this can also be written as

    Thus $D^{−1p is (any multiple of) an eigenvector of the Laplacian with eigenvalue 0.

    On any given step a random walk is equally likely to traverse every edge.

20) 图拉普拉斯应用——电阻网络 (Resistor Networks)

  • Let $V_i$ be the voltage at node $i$, measured relative to any convenient reference potential. Then Kirchhoff’s law says thatwhere $I_i$ represents any current injected into node $i$ by an external current source. In our case this external current is non-zero only for the two nodes $s$ and $t$ connected to the external voltage:(14) can be written as $k_iV_i-\Sigma_jA_{ij}V-j=RI_i$ orwhich in matrix form is where $L$ is once again the graph Laplacian. This equation is a kind of matrix version of the standard Ohm’s law $V=RI $ for a single resistor, and by solving it for $V$ we can calculate the voltages at every node in the network.

21) 图拉普拉斯的性质 (Properties)

  • It has the property that every row of the matrix sums to zero:

    Similarly every column of the matrix also sums to zero.

  • Since the Laplacian is a real symmetric matrix, it necessarily has real eigenvalues. But we can say more than this: all the eigenvalues of the Laplacian are also non-negative.

    Let $λ$ be any eigenvalue of the graph Laplacian and let $v$ be the corresponding eigenvector, unit normalized so that $v^Tv=1$. Then $Lv=λv$ and

    We can write

    We then get

    Thus all eigenvalues of the Laplacian are non-negative.

  • In fact the Laplacian always has at least one zero eigenvalue. As we have seen, every row of the matrix sums to zero, which means that the vector $1$$=(1, 1, 1, . . .)$ is always an eigenvector of the Laplacian with eigenvalue zero: $L1=0.$

  • The determinant of a matrix is the product of its eigenvalues, and hence the determinant of the Laplacian is always zero, so the matrix is singular.

  • If a network has only one component then the second smallest eigenvalue will be non-zero.

  • The second smallest eigenvalue of the Laplacian is called the algebraic connectivity of the network or the spectral gap.

  • The second smallest eigenvalue is non-zero if and only if the network is connected.

  • It is a straightforward extension of the same arguments to show that the number of zero eigenvalues of the Laplacian is in fact always exactly equal to the number of components in the network.

2. 网络测度指标

1)中心性 (Centrality)

  • This research addresses the question, “Which are the most important or central nodes in a network?”
[1.1] 度中心性 (Degree Centrality)
  • Degree is sometimes called degree centrality in the social networks literature, to emphasize its use as a centrality measure. (social network & citation network)
[1.2] 特征向量中心性 (Eigenvector Centrality)
  • In many circumstances a node’s importance in a network is increased by having connections to other nodes that are themselves important. Eigenvector centrality is an extension of degree centrality that takes this factor into account.

  • Consider an undirected network of n nodes. The eigenvector centrality $x_i$ of node $i$ is defined to be proportional to the sum of the centralities of $i$’s neighbors, so that

    where we have called the constant of proportionality $\kappa^{-1}$ for reasons that will become clear. For the moment we will leave the value of $\kappa$ arbitrary—we will choose a value shortly.

    An alternative way to write (22) is to make use of the adjacency matrix:

    This formula can also be written in matrix notation as $x=\kappa^{-1}Ax$, or equivalently

    where $x$ is the vector with elements equal to the centrality scores $x_i$ . In other words, $x$ is an eigenvector of the adjacency matrix.

  • With eigenvector centrality defined in this way, a node can achieve high centrality either by having a lot of neighbors with modest centrality, or by having a few neighbors with high centrality.

  • Assuming we want our centrality scores to be non-negative, there is only one choice: $x$ must be the leading eigenvector of the adjacency matrix, i.e., the eigenvector corresponding to the largest (most positive) eigenvalue.

  • Perron–Frobenius theorem:for a matrix with all elements non-negative, like the adjacency matrix, there is only one eigenvector that also has all elements non-negative, and that is the leading eigenvector.Every other eigenvector must have at least one negative element.
  • This also fixes the value of the constant $\kappa$—it must be equal to the largest eigenvalue.
  • A directed network has an adjacency matrix that is, in general, asymmetric. This means it has two sets of eigenvectors, the left eigenvectors and the right eigenvectors, and hence two leading eigenvectors. Which of the two should we use to define the centrality? In most cases the correct answer is to use the right eigenvector. (The reason is that centrality in directed networks is usually bestowed by other nodes that point towards you, rather than by you pointing to others.)
  • In mathematical terms, only nodes that are in a strongly connected component of two or more nodes, or the out-component of such a strongly connected component, can have non-zero eigenvector centrality.
[1.3] Katz中心性
  • We simply give each node a small amount of centrality “for free,” regardless of its position in the network or the centrality of its neighbors. In other words, we define

    where α and β are positive constants. The first term is the normal eigenvector centrality term in which the centralities of the nodes pointing to i are summed, and the second term is the “free” part, the constant extra amount that all nodes receive.

    In matrix terms, it can be written

    Rearrangeing for x, we then find that

    For convenience we usually set $\beta=1$, giving

    The definition of the Katz centrality contains the parameter $\alpha$, which governs the balance between the eigenvector centrality term and the constant term.

  • Most researchers have employed values($\alpha$) close to the maximum of $1/ \kappa_1$, which places the maximum amount of weight on the eigenvector term and the smallest amount on the constant term.

  • It allows a node that has many neighbors to have high centrality regardless of whether those neighbors themselves have high centrality, and this could be useful in some applications.
[1.4] PageRank
  • In many cases it means less if a node is only one among many that are pointed to. The centrality gained by virtue of receiving an edge from a prestigious node is diluted by being shared with so many others.

  • We can allow for this by defining a variant of the Katz centrality in which the centrality I derive from my network neighbors is proportional to their centrality divided by their out-degree. Then nodes that point to many others pass only a small amount of centrality on to each of those others, even if their own centrality is high. In mathematical terms this centrality is defined by

    In matrix terms, it is then

    with $1$ being again the vector $(1, 1, 1, . . .)$ and $D$ being the diagonal matrix with elements $D_{ii}=max(k^{out}_i , 1)$. Rearranging, we find that $x=\beta (I-\alpha AD^{-1})^{-1}1$, and thus, as before, $\beta$ plays the role only of an unimportant overall multiplier for the centrality. Conventionally, we set $\beta=1$, giving

    This centrality measure is commonly known as PageRank, which is a name given it by the Google web search corporation. We can see that the value of $\alpha$ should be less than the inverse of the largest eigenvalue of $AD^{−1}$. For an undirected network this largest eigenvalue turns out to be one, and thus $\alpha$ should be less than one.

[1.5] Hubs and Authorities
  • Authorities are nodes that contain useful information on a topic of interest and hubs are nodes that tell us where the best authorities are to be found.

  • The concept of hubs and authorities in networks was first put forward by Kleinberg and developed by him into a centrality algorithm called hyperlink-induced topic search or HITS. The HITS algorithm gives each node $i$ in a directed network two different centrality scores, the authority centrality $x_i$ and the hub centrality $y_i$ , which quantify nodes’ prominence in the two roles. The defining characteristic of a node with high authority centrality is that it is pointed to by many nodes with high hub centrality. Conversely, the defining characteristic of a node with high hub centrality is that it points to many nodes with high authority centrality.

  • In Kleinberg’s approach the authority centrality of a node is defined to be proportional to the sum of the hub centralities of the nodes that point to it:

    where $\alpha$ is a constant. Similarly, the hub centrality of a node is proportional to the sum of the authority centralities of the nodes it points to:

    with $\beta$ another constant. Note that the indices on the matrix element $A_{ji}$ are swapped around in this second equation: it is the nodes that $i$ points to that define its hub centrality.

    In matrix terms these equations can be written as

    or, combining the two,

    and

    where $\lambda=(\alpha \beta)^{-1}$. Thus the authority and hub centralities are respectively given by eigenvectors of $AA^T$ and $A^TA$ with the same eigenvalue. It is easily proved, however, that this is the case, and in fact that all eigenvalues are the same for the two matrices. (p169)

  • In the hub and authority approach nodes not cited by any others have authority centrality zero (which is reasonable), but they can still have non-zero hub centrality.

[1.6] 接近度中心性 (Closeness Centrality)
  • The mean shortest distance from $i$ to every node in the network is

    The inverse of $l_i$ is called the closeness centrality

  • One protential problem with the definition of closeness centrality concerns networks that have more than one component. Perhaps a better solution is to redefine closeness in terms of the harmonic mean distance between nodes, i.e., the average of the inverse distances:

    First, if $d_{ij}=\infty$ because $i$ and $j$ are in different components, then the corresponding term in the sum is simply zero and drops out. Second, the measure naturally gives more weight to nodes that are close to $i$ than to those far away.

[1.7] 介数中心性 (Betweenness Centrality)
  • Betweenness centrality may still be a reasonable guide to the influence nodes have over the flow of information between others.

  • Let $n^i_{st}$ be 1 if node $i$ lies on the shortest path from $s$ to $t$ and 0 if it does not or if there is no such path. Then the betweenness centrality $x_i$ is given by

    Note that this definition counts separately the shortest paths in either direction between each node pair.

  • Extension:if there are two shortest paths between a given pair of nodes, each of them gets weight $\frac{1}{2}$. Then the betweenness of a node is defined to be the sum of the weights of all shortest paths passing through that node.

  • If two or more paths pass through the same node then the betweenness sum includes contributions from each of them.

  • We redefine $n^i_{st}$ to be the number of shortest paths from $s$ to $t$ that pass through $i$ and we define $g_{st}$ to be the total number of shortest paths from $s$ to $t$. Then the betweenness centrality of node $i$ on a general network is

    where we adopt the convention that $n^i_{st}/g_{st}$ if both $n^i_{st}$ and $g_{st}$ are zero.

  • Betweenness centrality differs from the other centrality measures we have considered in being not principally a measure of how well-connected a node is. Instead it measures how much a node falls “between” others.

  • One natural choice is to normalize the path count by dividing by the total number of (ordered) node pairs, which is $n^2$, so that betweenness becomes the fraction (rather than the number) of paths that run through a given node:

    With this definition, the values of the betweenness lie strictly between zero and one.

  • Flow betweenness is a variant of betweenness centrality that uses edge-in- dependent paths between node pairs rather than shortest paths.

  • Another variant is random-walk betweenness, which imagines messages performing random walks across the network between every possible starting point and destination, and the betweenness is defined as the average number of such messages that pass through each node.

2)节点组 (Groups Of Nodes)

  • In this section we discuss some simpler concepts of network groups that can be useful for probing and describing the local structure of networks. The primary constructs we look at are cliques, k-cores, and k-components.
[2.1] 团 (Cliques)
  • A clique is a set of nodes within an undirected network such that every member of the set is connected by an edge to every other.
  • The one-mode projection creates a network that is naturally composed of cliques.
[2.2] 核 (Cores)
  • k-core is a connected set of nodes where each is joined to at least k of the others.
  • A simple way to find them is to start with a given network and remove from it any nodes that have degree less than k, along with their attached edges, since clearly such nodes cannot under any circumstances be members of a k-core. In so doing, one will normally reduce the degrees of some other nodes in the network—those that were connected to the nodes just removed. So we then go through the network again to see if there are any additional nodes that now have degree less than k and remove those too. And so we proceed, repeatedly pruning the network to remove nodes with degree less than k until no such nodes remain. What is left over will, by definition, be a k-core or a set of k-cores, since each node is connected to at least k others. Note that we are not necessarily left with a single k-core—there’s no guarantee that the network will be connected once we are done pruning it, even if it was connected to start with.
[2.3] 连通块与k连通块 (Components & k-Components)
  • A k-component (sometimes also called a k-connected component) is a set of nodes such that each is reachable from each of the others by at least k node-independent paths.
  • k-components are nested within each other.
  • So another way of defining a k-component would be to say that it is a subset of a network in which no pair of nodes can be disconnected from each other by removing less than k other nodes.
  • One disadvantage of k-components as a definition of node groups, is that for $k\ge3$ they can be non-contiguous.
  • k-components are sometimes defined slightly differently, to be a set of nodes such that every pair in the set is connected by at least k node-independent paths that themselves are contained entirely within the subset.

3) 传递性(Transitivity)与聚类系数(Clustering Coefficient)


文章作者: Peyton
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Peyton !
  目录