Graph Neural Networks# Limitations of Shallow embedding methods# Embedding matrix Z \bf Z Z , It’s shape. (dim. of embds, # nodes)
O ( ∣ V ∣ d ) O(|V|d) O ( ∣ V ∣ d ) parameters are needed:
No sharing of parameters between nodes
Every node has its own unique embedding
Inherently “transductive”:
Cannot generate embeddings for nodes that are not seen during training
Do not incorporate node features:
Nodes in many graphs have features that we can and should leverage
Graph# ENC \text{ENC} ENC : multiple layers of non-linear transformations based on graph structure
Tasks we will be able to solve:
Node classification: Predict the type of a given node
Link prediction: Predict whether two nodes are linked
Community detection: Identify densely linked clusters of nodes
Network similarity: How similar are two (sub)networks
Graph networks are more complex than images or text. (Arbitray size, complex topological structure)
Fundamental Func.Θ ∗ ← Θ − η ∇ Θ L \Theta^* \larr \Theta - \eta\nabla_{\Theta}{\cal L} Θ ∗ ← Θ − η ∇ Θ L
Adjacency matrix를 linear layer에 그대로 넣게 되면, linear layer의 weight들은 node order에 민감하여, 순서가 변경되면 매우 다른 값을 가지게 된다. 또한 O ( ∣ V ∣ ) O(|V|) O ( ∣ V ∣ ) paramters를 가지게 되고, 고정된 feature dim으로 다른 크기의 graph를 사용할 수 없다.
Graph does not have a canonical order of the nodes!
Graph is permutation invariant
X ∈ R ∣ V ∣ × d {\bf X} \in {\Bbb R}^{|V|\times d} X ∈ R ∣ V ∣ × d is a matrix of node features.
Permutation Invariant# Then, if f ( A i , X i ) = f ( A j , X j ) f({\bf A}_i, {\bf X}_i) = f({\bf A}_j, {\bf X}_j) f ( A i , X i ) = f ( A j , X j ) for any order plan i i i and j j j , we formally say f f f is a permutation invariant function.
Definition: For any graph function f : R ∣ V ∣ × m × R ∣ V ∣ × ∣ V ∣ → R d f: {\Bbb R}^{|V|\times m}\times {\Bbb R}^{|V|\times|V|} \rarr {\Bbb R}^d f : R ∣ V ∣ × m × R ∣ V ∣ × ∣ V ∣ → R d , f f f is permutation-invariant if f ( A , X ) = f ( P A P ⊤ , P X ) f({\bf A},{\bf X}) = f({\bf PAP^\top}, {\bf PX}) f ( A , X ) = f ( PA P ⊤ , PX ) for any permutation P \bf P P .
Permute the input, the output stays the same. (map a graph to a vector)
e.g. f f f : sum, average, pooling
Permutation Equivariance# If the output vector of a node at the same position in the graph remains unchanged for any order plan, we say f f f is permutation equivariant.
Definition: For any node function f : R ∣ V ∣ × m × R ∣ V ∣ × ∣ V ∣ → R ∣ V ∣ × m f: {\Bbb R}^{|V|\times m}\times {\Bbb R}^{|V|\times|V|} \rarr {\Bbb R}^{|V|\times m} f : R ∣ V ∣ × m × R ∣ V ∣ × ∣ V ∣ → R ∣ V ∣ × m , f f f is permutation-equivariant if P f ( A , X ) = f ( P A P ⊤ , P X ) {\bf P}f({\bf A},{\bf X}) = f({\bf PAP^\top}, {\bf PX}) P f ( A , X ) = f ( PA P ⊤ , PX ) for any permutation P \bf P P .
Permute the input, output also permutes accordingly. (map a graph to a matrix)
e.g. f f f : RNN, Self-attention
Examples:
f ( A , X ) = 1 ⊤ X f({\bf A},{\bf X}) = {\bf 1^\top X} f ( A , X ) = 1 ⊤ X : Permutation-invariant → f ( P A P ⊤ , P X ) = 1 ⊤ P X = 1 ⊤ X = f ( A , X ) \rarr f({\bf PAP^\top}, {\bf PX}) = {\bf 1^\top PX} = {\bf 1^\top X} = f({\bf A}, {\bf X}) → f ( PA P ⊤ , PX ) = 1 ⊤ PX = 1 ⊤ X = f ( A , X )
f ( A , X ) = X f({\bf A},{\bf X}) = {\bf X} f ( A , X ) = X : Permutation-equivariant → f ( P A P ⊤ , P X ) = P X = P f ( A , X ) \rarr f({\bf PAP^\top}, {\bf PX}) = {\bf PX} = {\bf P}f({\bf A}, {\bf X}) → f ( PA P ⊤ , PX ) = PX = P f ( A , X )
f ( A , X ) = A X f({\bf A},{\bf X}) = {\bf AX} f ( A , X ) = AX : Permutation-equivariant → f ( P A P ⊤ , P X ) = P A P ⊤ P X = P A X = P f ( A , X ) \rarr f({\bf PAP^\top}, {\bf PX}) = {\bf PAP^\top PX} = {\bf PAX} = {\bf P}f({\bf A}, {\bf X}) → f ( PA P ⊤ , PX ) = PA P ⊤ PX = PAX = P f ( A , X )
Graph Convolutional Networks# Idea: Node’s neighborhood defines a computation graph
cs224w - lecture 4. slide
Intuition: Nodes aggregate information from their neighbors using neural networks
Model can be of arbitrary depth.
Neighborhood Aggregation# Neighborhood aggregation: Key distinctions are in how different approaches aggregate information across the layers
Basic approach.
h v 0 = x v h v ( k + 1 ) = σ ( W k ∑ u ∈ N ( v ) h u ( k ) ∣ N ( v ) ∣ + B k h v ( k ) ) , ∀ k ∈ { 0 , … , K − 1 } z v = h v ( K ) \begin{aligned}
&{\rm h}_v^0={\rm x}_v\\
&{\rm h}_v^{(k+1)}=\sigma\left({\rm W}_k\sum_{u\in{\rm N}(v)}\frac{{\rm h}_u^{(k)}}{|{\rm N}(v)|}+{\rm B}_k{\rm h}_v^{(k)}\right),\ \forall k\in\{0,\ \dots,\ {\small K-1}\}\\
&{\rm z}_v={\rm h}_v^{(K)}
\end{aligned} h v 0 = x v h v ( k + 1 ) = σ W k u ∈ N ( v ) ∑ ∣ N ( v ) ∣ h u ( k ) + B k h v ( k ) , ∀ k ∈ { 0 , … , K − 1 } z v = h v ( K ) h v 0 {\rm h}_v^0 h v 0 : Initial 0th layer embeddings are equal to node features.σ \sigma σ : Non-linearity activation function.∑ u ∈ N ( v ) h u ( k ) ∣ N ( v ) ∣ \sum_{u\in{\rm N}(v)}\frac{{\rm h}_u^{(k)}}{|{\rm N}(v)|} ∑ u ∈ N ( v ) ∣ N ( v ) ∣ h u ( k ) : Average of neighbor’s previous layer embeddings.h v ( k ) {\rm h}_v^{(k)} h v ( k ) : Embedding of v v v at layer k k k z v {\rm z}_v z v : Embedding after K layers of neighborhood aggregationW k {\rm W}_k W k : weight matrix for neighborhood aggregationB k {\rm B}_k B k : weight matrix for transforming hidden vector of self
GCN: Invariance and Equivariance# Given a node, the GCN that computes its embedding is permutation invariant.
Average of neighbor’s previous layer embeddings - Permutation invariant. (Average 때문 순서 상관 없다)
Considering all nodes in a graph, GCN computation is permutation equivariant.
cs224w - lecture 4. slide
Let H ( k ) = [ h 1 ( k ) , … , h ∣ V ∣ ( k ) ] ⊤ {\rm H}^{(k)} = [h_1^{(k)},\ \dots,\ h_{|V|}^{(k)}]^\top H ( k ) = [ h 1 ( k ) , … , h ∣ V ∣ ( k ) ] ⊤ , then ∑ u ∈ N v h u ( k ) = A v H ( k ) \sum_{u\in N_v} h_u^{(k)}={\rm A}_v{\rm H^{(k)}} ∑ u ∈ N v h u ( k ) = A v H ( k )
Diagonal matrix D {\rm D} D
D v , v = deg ( v ) = ∣ N ( v ) ∣ , D v , v − 1 = 1 ∣ N ( v ) ∣ {\rm D}_{v,v} = \deg(v)= |N(v)|,\ {\rm D}_{v,v}^{-1} = \frac{1}{|N(v)|} D v , v = deg ( v ) = ∣ N ( v ) ∣ , D v , v − 1 = ∣ N ( v ) ∣ 1
Therefore, ∑ u ∈ N ( v ) h u ( k ) ∣ N ( v ) ∣ \sum_{u\in N(v)}\frac{{\rm h}_u^{(k)}}{|N(v)|} ∑ u ∈ N ( v ) ∣ N ( v ) ∣ h u ( k ) is H ( k + 1 ) = D − 1 A H ( k ) {\rm H}^{(k+1)} = {\rm D^{-1}AH}^{(k)} H ( k + 1 ) = D − 1 AH ( k ) .
H ( k + 1 ) = σ ( A ~ H ( k ) W k ⊤ + H ( k ) B k ⊤ ) {\rm H}^{(k+1)}=\sigma\left({\rm \tilde{A}}{\rm H}^{(k)}{\rm W}_k^\top + {\rm H}^{(k)}{\rm B}_k^\top \right) H ( k + 1 ) = σ ( A ~ H ( k ) W k ⊤ + H ( k ) B k ⊤ ) A ~ = D − 1 A \rm \tilde{A} = D^{-1}A A ~ = D − 1 A
How to Train a GNN# Supervised setting# Node label available
L = ∑ v ∈ V CE ( y v , σ ( z v ⊤ θ ) ) {\cal L}=\sum_{v\in V}\text{CE}(y_{v},\sigma({\rm z}_v^\top\theta)) L = ∑ v ∈ V CE ( y v , σ ( z v ⊤ θ )) z v ⊤ {\rm z}_v^\top z v ⊤ : Encoder outputy v y_v y v : Node class labelθ \theta θ : Classification weights
Unsupervised setting# No node label available
Use the graph structure as the supervision!
L = ∑ z u , z v CE ( y u , v , DEC ( z u , z v ) ) {\cal L}=\sum_{z_u, z_v}\text{CE}(y_{u,v},\text{DEC}(z_u,z_v)) L = ∑ z u , z v CE ( y u , v , DEC ( z u , z v )) “Similar” nodes have similar embeddings
DEC \text{DEC} DEC is the decoder such as inner product.
Model Process#
Define a neighborhood aggregation
Define a loss function on the embeddings
Train on a set of nodes, i.e., a batch of compute graphs
Generate embeddings for nodes as needed (Only compute the relevant nodes in the batch.)
Inductive Capability: New Nodes and New Graphs# The same aggregation parameters are shared for all nodes:
The number of model parameters is sublinear in ∣ V ∣ |V| ∣ V ∣ and we can generalize to unseen nodes!
Inductive node embedding –> Generalize to entirely unseen graphs
A General Perspective on GNN# A General GNN Framework#
cs224w - lecture 5. slide
Message
Aggregation
Layer connectivity
Graph augmentation Idea: Raw input graph ≠ \neq = Computational graph
Graph feature augmentation
Graph structure augmentation
Learning objective
A Single Layer of a GNN# Idea of a GNN Layer: Compress a set of vectors into a single vector
Message# Message function: Each node will create a message, which will be sent to other nodes later.
m u ( l ) = MSG ( l ) ( h u ( l − 1 ) ) {\rm m}_u^{(l)} = \text{\small MSG}^{(l)}({\rm h}_u^{(l
-1)}) m u ( l ) = MSG ( l ) ( h u ( l − 1 ) ) If message function(MSG \text{MSG} MSG ) is a linear layer, m u ( l ) = W ( l ) h u ( l − 1 ) {\rm m}_u^{(l)} = {\rm W}^{(l)}{\rm h}_u^{(l-1)} m u ( l ) = W ( l ) h u ( l − 1 )
Aggregation# Intuition: Each node will aggregate the messages from node v v v ’s neighbors
h u ( l ) = AGG ( l ) ( { m u ( l ) , u ∈ N ( v ) } ) {\rm h}_u^{(l)} = \text{\small AGG}^{(l)}(\{{\rm m}_u^{(l)},u\in N(v)\}) h u ( l ) = AGG ( l ) ({ m u ( l ) , u ∈ N ( v )}) Possible aggregation functions, sum ( ⋅ ) , mean ( ⋅ ) , max ( ⋅ ) \text{sum}(\cdot),\ \text{mean}(\cdot),\ \text{max}(\cdot) sum ( ⋅ ) , mean ( ⋅ ) , max ( ⋅ )
Issue : Information from node v v v itself could get lost
message 내에 포함된 정보는 자신을 제외한 v v v 이웃 노드들의 정보만 가지고 있기 때문이다.
Solution :
Compute message from node v v v itself.m v ( l ) = B ( l ) h v ( l − 1 ) {\rm m}_v^{(l)} = {\rm B}^{(l)}{\rm h}_v^{(l-1)} m v ( l ) = B ( l ) h v ( l − 1 )
h v ( l ) = CONCAT ( AGG ( l ) ( { m u ( l ) , u ∈ N ( v ) } ) , m v ( l ) ) {\rm h}_v^{(l)} = \text{\small CONCAT}(\text{\small AGG}^{(l)}(\{{\rm m}_u^{(l)},u\in N(v)\}), {\rm m}_v^{(l)}) h v ( l ) = CONCAT ( AGG ( l ) ({ m u ( l ) , u ∈ N ( v )}) , m v ( l ) )
Finally,
Message: m u ( l ) = MSG ( l ) ( h u ( l − 1 ) ) , u ∈ { N ( v ) ∪ v } {\rm m}_u^{(l)} = \text{\small MSG}^{(l)}({\rm h}_u^{(l-1)}), u \in \{N(v)\cup v\} m u ( l ) = MSG ( l ) ( h u ( l − 1 ) ) , u ∈ { N ( v ) ∪ v }
Aggregation: h v ( l ) = AGG ( l ) ( { m u ( l ) , u ∈ N ( v ) } , m v ( l ) ) {\rm h}_v^{(l)} = \text{\small AGG}^{(l)}(\{{\rm m}_u^{(l)},u\in N(v)\}, {\rm m}_v^{(l)}) h v ( l ) = AGG ( l ) ({ m u ( l ) , u ∈ N ( v )} , m v ( l ) )
Classical GNN Layers# Recap: a single GNN layer
Graph Convolutional Networks (GCN)# h v ( l ) = σ ( W ( l ) ∑ u ∈ N ( v ) h u ( l − 1 ) ∣ N ( v ) ∣ ) {\rm h}_v^{(l)}=\sigma\left({\rm W}^{(l)}\sum_{u\in{\rm N}(v)}\frac{{\rm h}_u^{(l-1)}}{|{\rm N}(v)|}\right) h v ( l ) = σ ( W ( l ) ∑ u ∈ N ( v ) ∣ N ( v ) ∣ h u ( l − 1 ) )
m u ( l ) = 1 ∣ N ( v ) ∣ W ( l ) h u ( l − 1 ) {\rm m}_u^{(l)} = \frac{1}{|N(v)|}{\rm W}^{(l)}{\rm h}_u^{(l-1)} m u ( l ) = ∣ N ( v ) ∣ 1 W ( l ) h u ( l − 1 )
h v ( l ) = σ ( sum ( { m u ( l ) , u ∈ N ( v ) } ) ) {\rm h}_v^{(l)} = \sigma(\text{sum}(\{{\rm m}_u^{(l)},u\in N(v)\})) h v ( l ) = σ ( sum ({ m u ( l ) , u ∈ N ( v )}))
GraphSAGE# h v ( l ) = σ ( W ( l ) ⋅ CONCAT ( AGG ( { h u ( l − 1 ) , ∀ u ∈ N ( v ) } ) , h v ( l − 1 ) ) ) {\rm h}_v^{(l)} = \sigma\left({\rm W}^{(l)}\cdot\text{\small CONCAT}\left(\text{\small AGG}(\{{\rm h}_u^{(l-1)},\forall u\in N(v)\}), {\rm h}_v^{(l-1)}\right)\right) h v ( l ) = σ ( W ( l ) ⋅ CONCAT ( AGG ({ h u ( l − 1 ) , ∀ u ∈ N ( v )}) , h v ( l − 1 ) ) )
Message is computed within the AGG ( ⋅ ) \text{AGG}(\cdot) AGG ( ⋅ )
Two-stage aggregation
h N ( v ) ( l ) ← AGG ( { h u ( l − 1 ) , ∀ u ∈ N ( v ) } ) {\rm h}_{N(v)}^{(l)}\larr \text{\small AGG}(\{{\rm h}_u^{(l-1)},\forall u\in N(v)\}) h N ( v ) ( l ) ← AGG ({ h u ( l − 1 ) , ∀ u ∈ N ( v )})
h v ( l ) ← σ ( W ( l ) ⋅ CONCAT ( h N ( v ) ( l ) , h v ( l − 1 ) ) ) {\rm h}_{v}^{(l)}\larr \sigma\left({\rm W}^{(l)}\cdot\text{\small CONCAT}\left({\rm h}_{N(v)}^{(l)}, {\rm h}_v^{(l-1)}\right)\right) h v ( l ) ← σ ( W ( l ) ⋅ CONCAT ( h N ( v ) ( l ) , h v ( l − 1 ) ) )
Neighbor aggregation methods#
Mean: Take a weighted average of neighborsAGG = ∑ u ∈ N ( v ) h u ( l − 1 ) ∣ N ( v ) ∣ \text{\small AGG} = \sum_{u\in{\rm N}(v)}\frac{{\rm h}_u^{(l-1)}}{|{\rm N}(v)|} AGG = ∑ u ∈ N ( v ) ∣ N ( v ) ∣ h u ( l − 1 )
Pool: Transform neighbor vectors and apply symmetric vector function mean ( ⋅ ) \text{mean}(\cdot) mean ( ⋅ ) or max ( ⋅ ) \text{max}(\cdot) max ( ⋅ ) AGG = mean ( { MLP ( h u ( l − 1 ) ) , ∀ u ∈ N ( v ) } ) \text{\small AGG} = \text{\small mean}(\{\text{\small MLP}({\rm h}_u^{(l-1)}),\forall u\in N(v)\}) AGG = mean ({ MLP ( h u ( l − 1 ) ) , ∀ u ∈ N ( v )})
LSTM: Apply LSTM to reshuffled of neighborsAGG = LSTM ( [ h u ( l − 1 ) , ∀ u ∈ π ( N ( v ) ) ] ) \text{\small AGG} = \text{\small LSTM}([{\rm h}_u^{(l-1)},\forall u\in \pi(N(v))]) AGG = LSTM ([ h u ( l − 1 ) , ∀ u ∈ π ( N ( v ))])
ℓ 2 {\ell}_2 ℓ 2 Normalization# h v ( l ) ← h v ( l ) ∥ h v ( l ) ∥ 2 , ∀ v ∈ V {\rm h}_v^{(l)} \larr \frac{{\rm h}_v^{(l)}}{{\|{\rm h}_v^{(l)}\|}_2}, \forall v\in V h v ( l ) ← ∥ h v ( l ) ∥ 2 h v ( l ) , ∀ v ∈ V where ∥ u ∥ 2 = ∑ i u i 2 {\|u\|}_2 = \sqrt{\sum_i u_i^2} ∥ u ∥ 2 = ∑ i u i 2
💡 Without ℓ 2 \ell_2 ℓ 2 normalization, the embedding vectors have different scales (ℓ 2 \ell_2 ℓ 2 -norm) for vectors
Graph Attention Networks# h v ( l ) = σ ( ∑ u ∈ N ( v ) α v u W ( l ) h u ( l − 1 ) ) {\rm h}_v^{(l)} = \sigma\left(\sum_{u\in N(v)}\alpha_{vu}{\rm W}^{(l)}{\rm h}_u^{(l-1)}\right) h v ( l ) = σ ( ∑ u ∈ N ( v ) α vu W ( l ) h u ( l − 1 ) ) In GCN and GraphSAGE:
α v u = 1 ∣ N ( v ) ∣ \alpha_{vu} = \frac{1}{|N(v)|} α vu = ∣ N ( v ) ∣ 1 is weighting factor (importance) of node u u u ’s message to node v v v .
α v u \alpha_{vu} α vu is defined explicitly based on the structural properties of the graph (node degree).
All neighbors u ∈ N ( v ) u \in N(v) u ∈ N ( v ) are equally important to node v v v .
Goal: Specify arbitrary importance to different neighbors of each node in the graph
Idea: Compute embedding h v ( l ) {\rm h}_v^{(l)} h v ( l ) of each node in the graph following an attention strategy:
Nodes attend over their neighborhoods’ message
Implicitly specifying different weights to different nodes in a neighborhood
Attention Mechanism# Let α v u \alpha_{vu} α vu be computed as a byproduct of an attention mechanism a a a :
Let a a a compute attention coefficients e v u e_{vu} e vu across pairs of nodes u u u , v v v based on their messages:
e v u = a ( W ( l ) h u ( l − 1 ) , W ( l ) h v ( l − 1 ) ) e_{vu} = a({\rm W}^{(l)}{\rm h}_u^{(l-1)},{\rm W}^{(l)}{\rm h}_v^{(l-1)}) e vu = a ( W ( l ) h u ( l − 1 ) , W ( l ) h v ( l − 1 ) )
Normalize with softmax
α v u = exp ( e v u ) ∑ k ∈ N ( v ) exp ( e v k ) \alpha_{vu} = \frac{\exp(e_{vu})}{\sum_{k\in N(v)}\exp(e_{vk})} α vu = ∑ k ∈ N ( v ) e x p ( e v k ) e x p ( e vu )
Weighted sum based on the final attention weight α v u \alpha_{vu} α vu
cs224w - lecture 5. slide
h A ( l ) = σ ( α AB W ( l ) h B ( l − 1 ) + α AC W ( l ) h C ( l − 1 ) + α AD W ( l ) h D ( l − 1 ) ) \begin{aligned}
{\rm h}_A^{(l)} = \sigma(
&\alpha_{\text{AB}}{\rm W}^{(l)}{\rm h}_\text{B}^{(l-1)}\\
&+\alpha_{\text{AC}}{\rm W}^{(l)}{\rm h}_\text{C}^{(l-1)}\\
&+\alpha_{\text{AD}}{\rm W}^{(l)}{\rm h}_\text{D}^{(l-1)})
\end{aligned} h A ( l ) = σ ( α AB W ( l ) h B ( l − 1 ) + α AC W ( l ) h C ( l − 1 ) + α AD W ( l ) h D ( l − 1 ) )
What is the form of attention mechanism a a a ?
The approach is agnostic to the choice of a a a
Simple example
cs224w - lecture 5. slide
Multi-head attention# Stabilizes the learning process of attention mechanism (하나의 head로만은 bias 되기 쉽다.)
h v ( l ) [ 1 ] = σ ( ∑ u ∈ N ( v ) α v u 1 W ( l ) h u ( l − 1 ) ) {\rm h}_v^{(l)}[1] = \sigma\left(\sum_{u\in N(v)}\alpha_{vu}^1{\rm W}^{(l)}{\rm h}_u^{(l-1)}\right) h v ( l ) [ 1 ] = σ ( ∑ u ∈ N ( v ) α vu 1 W ( l ) h u ( l − 1 ) ) h v ( l ) [ 2 ] = σ ( ∑ u ∈ N ( v ) α v u 2 W ( l ) h u ( l − 1 ) ) {\rm h}_v^{(l)}[2] = \sigma\left(\sum_{u\in N(v)}\alpha_{vu}^2{\rm W}^{(l)}{\rm h}_u^{(l-1)}\right) h v ( l ) [ 2 ] = σ ( ∑ u ∈ N ( v ) α vu 2 W ( l ) h u ( l − 1 ) ) h v ( l ) [ 3 ] = σ ( ∑ u ∈ N ( v ) α v u 3 W ( l ) h u ( l − 1 ) ) {\rm h}_v^{(l)}[3] = \sigma\left(\sum_{u\in N(v)}\alpha_{vu}^3{\rm W}^{(l)}{\rm h}_u^{(l-1)}\right) h v ( l ) [ 3 ] = σ ( ∑ u ∈ N ( v ) α vu 3 W ( l ) h u ( l − 1 ) )
h v ( l ) = AGG ( h v ( l ) [ 1 ] , h v ( l ) [ 2 ] , h v ( l ) [ 3 ] ) {\rm h}_v^{(l)} = \text{\small AGG}({\rm h}_v^{(l)}[1],{\rm h}_v^{(l)}[2],{\rm h}_v^{(l)}[3]) h v ( l ) = AGG ( h v ( l ) [ 1 ] , h v ( l ) [ 2 ] , h v ( l ) [ 3 ])
Key benefit: Allows for (implicitly) specifying different importance values (α v u \alpha_{vu} α vu ) to different neighbors
Computationally efficient:
Computation of attentional coefficients can be parallelized across all edges of the graph
Aggregation may be parallelized across all nodes
Storage efficient:
Sparse matrix operations do not require more than O ( V + E ) O(V+E) O ( V + E ) entries to be stored
Fixed number of parameters, irrespective of graph size
Localized:
Only attends over local network neighborhoods
Inductive capability:
It is a shared edge-wise mechanism
It does not depend on the global graph structure
Batch Normalization#
μ j = 1 N ∑ i = 1 N X i , j , σ j 2 = 1 N ∑ i = 1 N ( X i , j − μ j ) 2 \mu_j = \frac{1}{N} \sum_{i=1}^N{\rm X}_{i,j},\ \sigma_j^2 = \frac{1}{N} \sum_{i=1}^N{({\rm X}_{i,j}-\mu_j)}^2 μ j = N 1 ∑ i = 1 N X i , j , σ j 2 = N 1 ∑ i = 1 N ( X i , j − μ j ) 2
X ^ i , j = X i , j − μ j σ j 2 + ϵ , Y i , j = γ j X ^ i , j + β j {\rm \widehat{X}}_{i,j}=\frac{{\rm X}_{i,j}-\mu_j}{\sqrt{\sigma_j^2+\epsilon}},\ {\rm Y}_{i,j}= \gamma_j {\rm \widehat{X}}_{i,j} + {\beta}_j X i , j = σ j 2 + ϵ X i , j − μ j , Y i , j = γ j X i , j + β j γ , β \gamma, \beta γ , β : Trainable Parameters
Stacking Layers of a GNN# How to construct a Graph Neural Network?
The standard way: Stack GNN layers sequentially
The Issue of stacking many GNN layers
What do we learn from the over-smoothing problem?
Be cautious when adding GNN layers
Step 1: Analyze the necessary receptive field to solve your problem.
Step 2: Set number of GNN layers L L L to be a bit more than the receptive field we like. Do not set L L L to be unnecessarily large!
How to enhance the expressive power of a GNN, if the number of GNN layers is small?
We can make aggregation / transformation become a deep neural network!
Add layers that do not pass messages (A GNN does not necessarily only contain GNN layers)
Pre-processing layers: Important when encoding node features is necessary. E.g., when nodes represent images/text
Post-processing layers: Important when reasoning / transformation over node embeddings are needed E.g., graph classification, knowledge graphs
What if my problem still requires many GNN layers?
Add skip connections in GNNs
Keys: Node embeddings in earlier GNN layers can sometimes better differentiate nodes
Solution: We can increase the impact of earlier layers on the final node embeddings, by adding shortcuts in GNN
F ( x ) + x F(\rm x) + x F ( x ) + x
Skip connections create a mixture of models. (N N N skip connections → 2 N 2^N 2 N possible paths)
We automatically get a mixture of shallow GNNs and deep GNNs
Other options: Directly skip to the last layer
Why need to manipulate graphs?
Our assumption so far has been: Raw input graph = Computational graph
Reasons for breaking this assumption.
Feature level:
The input graph lacks features → \rarr → feature augmentation
Certain structures are hard to learn by GNN (ex. cycle)
Structure level:
The graph is too sparse (Inefficient message passing) → \rarr → Add virtual nodes / edges
Suppose in a sparse graph, two nodes have shortest path distance of 10.
After adding the virtual node, all the nodes will have a distance of 2 (Node A – Virtual node – Node B) → \rarr → Greatly improves message passing in sparse graphs
The graph is too dense (Message passing is too costly) → \rarr → Sample neighbors when doing message passing
The graph is too large (Cannot fit the computational graph into a GPU) → \rarr → Sample subgraphs to compute embeddings