Graphs - Embedding

Summary of GNN lectures

Node Embeddings#

Graph Representation Learning alleviates the need to do feature engineering every single time.

Goal: Efficient task-independent feature learning for machine learning with graphs!

f:uRdf:u \rarr \Bbb{R}^d

Task: Map nodes into an embedding space

  • Similarity of embeddings between nodes indicates their similarity in the network.
    • For example: Both nodes are close to each other (connected by an edge)
  • Encode network information
  • Potentially used for many downstream predictions
similarity(u,v)zvzu\text{similarity}(u,v) \approx {\bf z_{\it v}^{\intercal}}{\bf z_{\it u}}

Encoder maps from nodes to embeddings

Decoder DEC\text{DEC} maps from embeddings to the similarity score

shallow encoder: DeepWalk, Node2Vec

This is unsupervised/self-supervised way of learning node embeddings.

  • We are not utilizing node labels and features.
  • The goal is to directly estimate a set of coordinates(i.e., the embedding) of a node so that some aspect of the network structure (captured by DEC\text{DEC}) is preserved.

These embeddings are task independent

  • They are not trained for a specific task but can be used for any task.

Random Walk Approaches for Node Embeddings#

zu{\bf z}_u: the embedding of node uu (what we aim to find).

P(vzu){\rm P}(v | {\bf z}_u): the (predicted) probability of visiting node vv on random walks starting from node uu.

Non-linear functions used to produce predicted probabilities

Estimate probability of visiting node vv on a random walk starting from node uu using some random walk strategy RR.

cs224w - lecture 3. slide

Optimize embeddings to encode these random walk statistics:

cs224w - lecture 3. slide

Expressivity: Flexible stochastic definition of node similarity that incorporates both local and higher-order neighborhood information.

Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks.

NR(u){\rm N_R}(u) … neighbourhood of uu obtained by some random walk strategy RR

G=(V,E)G = (V,E)

maxfuVlogP(NR(u)zu), (Maximum likelihood objective)\max_f \sum_{u\in V}\log {\rm P}(N{\rm _R}(u)|{\bf z}_u),\ \small\text{(Maximum likelihood objective)}
  1. Run short fixed-length random walks starting from each node uu in the graph using some random walk strategy R\rm R.
  2. For each node 𝑢 collect NR(u){\rm N_R}(u), the multiset* of nodes visited on random walks starting from uu.
  3. Optimize embeddings according to: Given node uu, predict its neighbors NR(u){\rm N_R}(u).

Equivalently,

L=uVvNR(u)log(P(vzu)){\cal L}=\sum_{u\in{V}}\sum_{v\in {\rm N_R}(u)}-\log({\rm P}(v|{\bf z}_u))

Intuition: Optimize embeddings zu{\bf z}_u to maximize the likelihood of random walk co-occurrences.

Parameterize probability using softmax, because we want node 𝑣 to be most similar to node uu (out of all nodes nn).

P(vzu)=exp(zuzv)nVexp(zuzn){\rm P}(v | {\bf z}_u) = \frac{\exp({\bf z_{\it u}^\intercal}{\bf z_{\it v}}
)}{\sum_{n \in V} \exp({\bf z_{\it u}^\intercal}{\bf z_{\it n}}
)}

But doing this naively is too expensive! O(V2)O(|V|^2) complexity!

Solution: Negative sampling#

  • Why is the approximation valid?

    Technically, this is a different objective. But Negative Sampling is a form of Noise Contrastive Estimation (NCE) which approx. maximizes the log probability of softmax. New formulation corresponds to using a logistic regression (sigmoid func.) to distinguish the target node vv from nodes nin_i sampled from background distribution PvP_v.

    More at word2vec Explained, arXiv

log(exp(zuzv)nVexp(zuzn))log(σ(zuzv))i=1klog(σ(zuzni)), niPV\log\left(\frac{\exp({\bf z_{\it u}^\intercal}{\bf z_{\it v}}
)}{\sum_{n \in V} \exp({\bf z_{\it u}^\intercal}{\bf z_{\it n}}
)}\right)
\approx \log(\sigma({\bf z_{\it u}^\intercal}{\bf z_{\it v}})) - \sum_{i=1}^{k}\log(\sigma({\bf z_{\it u}^\intercal}{\bf z_{\it n_i}})), \ n_i\sim {\rm P}_V

w.r.t: with reference to

Instead of normalizing w.r.t. all nodes, just normalize against kk random “negative samples” nin_i.

Two considerations for kk (# negative samples):

  1. Higher kk gives more robust estimates
  2. Higher kk corresponds to higher bias on negative events

In practice kk = 5-20.

Can negative sample be any node or only the nodes not on the walk?
People often use any nodes (for efficiency). However, the most “correct” way is to use nodes not on the walk.

Summary#

  1. Run short fixed-length random walks starting from each node on the graph
  2. For each node uu collect NR(u)N_{\rm R}(u), the multiset of nodes visited on random walks starting from uu.
  3. Optimize embeddings using Stochastic Gradient Descent

What strategies should we use to run these random walks?

  • Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013); The issue is that such notion of similarity is too constrained.

node2vec#

Grover and Leskovec, node2vec: Scalable Feature Learning for Networks. KDD. 2016. Paper_PDF

Goal: Embed nodes with similar network neighborhoods close in the feature space.

Key observation: Flexible notion of network neighborhood NR(u)N_{\rm R}(u) of node uu leads to rich node embeddings.

Develop biased 2nd order random walk R\rm R to generate network neighborhood NR(u)N_{\rm R}(u) of node uu.

Idea: use flexible, biased random walks that can trade off between local(BFS) and global(DFS) views of the network.

Biased fixed-length random walk R\rm R that given a node 𝒖 generates neighborhood NR(u)N_{\rm R}(u)

Two parameters:

  • Return parameter pp: Return back to the previous node
  • In-out parameter qq: Moving outwards (DFS) vs. inwards (BFS)

Intuitively, qq is the “ratio” of BFS vs. DFS

Process:

  1. Compute random walk probabilities
  2. Simulate rr random walks of length ll starting from each node uu.
  3. Optimize the node2vec objective using Stochastic Gradient Descent

Other Random Walk ideas#

  • Different kinds of biased random walks:
    • Based on node attributes (Dong et al., 2017)
    • Based on learned weights (Abu-El-Haija et al., 2017)
  • Alternative optimization schemes:
    • Directly optimize based on 1-hop and 2-hop random walk probabilities (as in LINE from Tang et al. 2015).
  • Network preprocessing techniques:
    • Run random walks on modified versions of the original network (e.g., Ribeiro et al. 2017’s struct2vec, Chen et al. 2016’s HARP).

Embedding Entire Graphs#

Goal: Want to embed a subgraph or an entire graph GG. Graph embedding: zG\bf z_G.

Tasks: Classifying toxic vs. non-toxic molecules, Identifying anomalous graphs

  • Approach 1: Simple (but effective)
    • Run a standard graph embedding technique on the (sub)graph GG.
    • Then just sum (or average) the node embeddings in the (sub)graph GG.
      https://arxiv.org/abs/1509.09292
  • Approach 2: Super-node
  • Approach 3: Hierarchical Embeddings

Node Embeddings and Matrix Factorization#

Embedding matrix Z\bf Z, It’s shape. (dim. of embds, # nodes)

Objective: maximize zvzu{\bf z_{\it v}^\intercal}{\bf z_{\it u}} for node pairs (u,v)(u, v) that are similar

zvzu=Au,v{\bf z_{\it v}^\intercal}{\bf z_{\it u}} = A_{u,v}, which is the (u,v)(u, v) entry of the graph adjacency matrix AA.

Exact factorization A=ZZ\bf A = Z^{\intercal}Z is generally not possible.

However, we can learn 𝒁 approximately !

Objective: minZAZZ2\displaystyle\min_{\bf Z} \|{\bf A-Z^{\intercal}Z}\|_2

  • We optimize 𝒁 such that it minimizes the L2 norm (Frobenius norm) of AZZ{\bf A-Z^{\intercal}Z}
  • Note today we used softmax instead of L2. But the goal to approximate A\bf A with ZZ\bf Z^{\intercal}Z is the same.

Conclusion: Inner product decoder with node similarity defined by edge connectivity is equivalent to matrix factorization of A\bf A.

Random Walk-based Similarity#

DeepWalk and node2vec have a more complex node similarity definition based on random walks.

log(vol(G)(1Tr=1T(D1A)r)D1)logb\log\left({\rm vol}(G)\left(\frac{1}{T}\sum_{r=1}^{T}(D^{-1}A)^r\right)D^{-1} \right)-\log b

vol(G)=ijAi,j{\rm vol}(G) = \sum_i\sum_j A_{i,j}: Volume of graph

T=NR(u)T = |N_{\rm R}(u)|: context window size

rr: Power of normalized adjacency matrix

Du,u=deg(u)D_{u,u} = {\rm deg}(u): Diagonal matrix

bb: Number of negative samples

Examples of Embeddings#

  • Clustering/community detection: Cluster points ziz_i

  • Node classification: Predict label of node ii based on ziz_i

  • Link prediction: Predict edge (ii, jj) based on (ziz_i, zjz_j)

    • Where we can: concatenate, avg, product, or take a difference between the embeddings:
      • Concatenate: f(zi,zj)=g([zi,zj])f(z_i, z_j) = g([z_i, z_j])
      • Hadamard: f(zi,zj)=g(zizj)f(z_i, z_j) = g(z_i * z_j) (per coordinate product)
      • Sum/Avg: f(zi,zj)=g(zi+zj)f(z_i, z_j) = g(z_i + z_j)
      • Distance: f(zi,zj)=g(zizj2)f(z_i, z_j) = g({\|z_i - z_j\|}_2)
  • Graph classification: Graph embedding zGz_G via aggregating node embeddings or anonymous random walks. Predict label based on graph embedding zGz_G.

Limitations#

  1. Cannot obtain embeddings for nodes not in the training set; Not connected nodes (New user in a social network)
  2. DeepWalk and node2vec do not capture structural similarity.
  3. Cannot utilize node, edge and graph features.

Solution to these limitations: Deep Representation Learning and Graph Neural Networks