Graphs - Embedding

Similarity of embeddings between nodes indicates their similarity in the network.
- For example: Both nodes are close to each other (connected by an edge)
Encode network information
Potentially used for many downstream predictions

\text{similarity}(u,v) \approx {\bf z_{\it v}^{\intercal}}{\bf z_{\it u}}

Encoder maps from nodes to embeddings

Decoder $\text{DEC}$ maps from embeddings to the similarity score

shallow encoder: DeepWalk, Node2Vec

This is unsupervised/self-supervised way of learning node embeddings.

We are not utilizing node labels and features.
The goal is to directly estimate a set of coordinates(i.e., the embedding) of a node so that some aspect of the network structure (captured by $\text{DEC}$ ) is preserved.

These embeddings are task independent

They are not trained for a specific task but can be used for any task.

Random Walk Approaches for Node Embeddings#

${\bf z}_u$ : the embedding of node $u$ (what we aim to find).

${\rm P}(v | {\bf z}_u)$ : the (predicted) probability of visiting node $v$ on random walks starting from node $u$ .

Non-linear functions used to produce predicted probabilities

Estimate probability of visiting node $v$ on a random walk starting from node $u$ using some random walk strategy $R$ .

cs224w - lecture 3. slide

Optimize embeddings to encode these random walk statistics:

cs224w - lecture 3. slide

Expressivity: Flexible stochastic definition of node similarity that incorporates both local and higher-order neighborhood information.

Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks.

${\rm N_R}(u)$ … neighbourhood of $u$ obtained by some random walk strategy $R$

$G = (V,E)$

\max_f \sum_{u\in V}\log {\rm P}(N{\rm _R}(u)|{\bf z}_u),\ \small\text{(Maximum likelihood objective)}

Run short fixed-length random walks starting from each node $u$ in the graph using some random walk strategy $\rm R$ .
For each node 𝑢 collect ${\rm N_R}(u)$ , the multiset* of nodes visited on random walks starting from $u$ .
Optimize embeddings according to: Given node $u$ , predict its neighbors ${\rm N_R}(u)$ .

Equivalently,

{\cal L}=\sum_{u\in{V}}\sum_{v\in {\rm N_R}(u)}-\log({\rm P}(v|{\bf z}_u))

Intuition: Optimize embeddings ${\bf z}_u$ to maximize the likelihood of random walk co-occurrences.

Parameterize probability using softmax, because we want node 𝑣 to be most similar to node $u$ (out of all nodes $n$ ).

{\rm P}(v | {\bf z}_u) = \frac{\exp({\bf z_{\it u}^\intercal}{\bf z_{\it v}} )}{\sum_{n \in V} \exp({\bf z_{\it u}^\intercal}{\bf z_{\it n}} )}

But doing this naively is too expensive! $O(|V|^2)$ complexity!

Solution: Negative sampling#

Why is the approximation valid?

Technically, this is a different objective. But Negative Sampling is a form of Noise Contrastive Estimation (NCE) which approx. maximizes the log probability of softmax. New formulation corresponds to using a logistic regression (sigmoid func.) to distinguish the target node $v$ from nodes $n_i$ sampled from background distribution $P_v$ .

More at word2vec Explained, arXiv

\log\left(\frac{\exp({\bf z_{\it u}^\intercal}{\bf z_{\it v}} )}{\sum_{n \in V} \exp({\bf z_{\it u}^\intercal}{\bf z_{\it n}} )}\right) \approx \log(\sigma({\bf z_{\it u}^\intercal}{\bf z_{\it v}})) - \sum_{i=1}^{k}\log(\sigma({\bf z_{\it u}^\intercal}{\bf z_{\it n_i}})), \ n_i\sim {\rm P}_V

w.r.t: with reference to

Instead of normalizing w.r.t. all nodes, just normalize against $k$ random “negative samples” $n_i$ .

Two considerations for $k$ (# negative samples):

Higher $k$ gives more robust estimates
Higher $k$ corresponds to higher bias on negative events

In practice $k$ = 5-20.

Can negative sample be any node or only the nodes not on the walk?
People often use any nodes (for efficiency). However, the most “correct” way is to use nodes not on the walk.

Summary#

Run short fixed-length random walks starting from each node on the graph
For each node $u$ collect $N_{\rm R}(u)$ , the multiset of nodes visited on random walks starting from $u$ .
Optimize embeddings using Stochastic Gradient Descent

What strategies should we use to run these random walks?

Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013); The issue is that such notion of similarity is too constrained.

node2vec#

Grover and Leskovec, node2vec: Scalable Feature Learning for Networks. KDD. 2016. Paper_PDF

Goal: Embed nodes with similar network neighborhoods close in the feature space.

Key observation: Flexible notion of network neighborhood $N_{\rm R}(u)$ of node $u$ leads to rich node embeddings.

Develop biased 2nd order random walk $\rm R$ to generate network neighborhood $N_{\rm R}(u)$ of node $u$ .

Idea: use flexible, biased random walks that can trade off between local(BFS) and global(DFS) views of the network.

Biased fixed-length random walk $\rm R$ that given a node 𝒖 generates neighborhood $N_{\rm R}(u)$

Two parameters:

Return parameter $p$ : Return back to the previous node
In-out parameter $q$ : Moving outwards (DFS) vs. inwards (BFS)

Intuitively, $q$ is the “ratio” of BFS vs. DFS

Process:

Compute random walk probabilities
Simulate $r$ random walks of length $l$ starting from each node $u$ .
Optimize the node2vec objective using Stochastic Gradient Descent

Other Random Walk ideas#

Different kinds of biased random walks:
- Based on node attributes (Dong et al., 2017)
- Based on learned weights (Abu-El-Haija et al., 2017)
Alternative optimization schemes:
- Directly optimize based on 1-hop and 2-hop random walk probabilities (as in LINE from Tang et al. 2015).
Network preprocessing techniques:
- Run random walks on modified versions of the original network (e.g., Ribeiro et al. 2017’s struct2vec, Chen et al. 2016’s HARP).

Embedding Entire Graphs#

Goal: Want to embed a subgraph or an entire graph $G$ . Graph embedding: $\bf z_G$ .

Tasks: Classifying toxic vs. non-toxic molecules, Identifying anomalous graphs

Approach 1: Simple (but effective)
- Run a standard graph embedding technique on the (sub)graph $G$ .
- Then just sum (or average) the node embeddings in the (sub)graph $G$ .
  https://arxiv.org/abs/1509.09292
Approach 2: Super-node
- Introduce a “virtual node” to represent the (sub)graph and run a standard graph embedding technique
  https://arxiv.org/abs/1511.05493
Approach 3: Hierarchical Embeddings

Node Embeddings and Matrix Factorization#

Embedding matrix $\bf Z$ , It’s shape. (dim. of embds, # nodes)

Objective: maximize ${\bf z_{\it v}^\intercal}{\bf z_{\it u}}$ for node pairs $(u, v)$ that are similar

${\bf z_{\it v}^\intercal}{\bf z_{\it u}} = A_{u,v}$ , which is the $(u, v)$ entry of the graph adjacency matrix $A$ .

Exact factorization $\bf A = Z^{\intercal}Z$ is generally not possible.

However, we can learn 𝒁 approximately !

Objective: $\displaystyle\min_{\bf Z} \|{\bf A-Z^{\intercal}Z}\|_2$

We optimize 𝒁 such that it minimizes the L2 norm (Frobenius norm) of ${\bf A-Z^{\intercal}Z}$
Note today we used softmax instead of L2. But the goal to approximate $\bf A$ with $\bf Z^{\intercal}Z$ is the same.

Conclusion: Inner product decoder with node similarity defined by edge connectivity is equivalent to matrix factorization of $\bf A$ .

Random Walk-based Similarity#

DeepWalk and node2vec have a more complex node similarity definition based on random walks.

\log\left({\rm vol}(G)\left(\frac{1}{T}\sum_{r=1}^{T}(D^{-1}A)^r\right)D^{-1} \right)-\log b

${\rm vol}(G) = \sum_i\sum_j A_{i,j}$ : Volume of graph

$T = |N_{\rm R}(u)|$ : context window size

$r$ : Power of normalized adjacency matrix

$D_{u,u} = {\rm deg}(u)$ : Diagonal matrix

$b$ : Number of negative samples

Examples of Embeddings#

Clustering/community detection: Cluster points $z_i$
Node classification: Predict label of node $i$ based on $z_i$
Link prediction: Predict edge ( $i$ , $j$ ) based on ( $z_i$ , $z_j$ )
- Where we can: concatenate, avg, product, or take a difference between the embeddings:
  - Concatenate: $f(z_i, z_j) = g([z_i, z_j])$
  - Hadamard: $f(z_i, z_j) = g(z_i * z_j)$ (per coordinate product)
  - Sum/Avg: $f(z_i, z_j) = g(z_i + z_j)$
  - Distance: $f(z_i, z_j) = g({\|z_i - z_j\|}_2)$
Graph classification: Graph embedding $z_G$ via aggregating node embeddings or anonymous random walks. Predict label based on graph embedding $z_G$ .

Limitations#

Cannot obtain embeddings for nodes not in the training set; Not connected nodes (New user in a social network)
DeepWalk and node2vec do not capture structural similarity.
Cannot utilize node, edge and graph features.

Solution to these limitations: Deep Representation Learning and Graph Neural Networks

previous Graphs - Features
next Graphs - Neural Networks