Node Embeddings#
Graph Representation Learning alleviates the need to do feature engineering every single time.
Goal: Efficient task-independent feature learning for machine learning with graphs!
f:u→Rd
Task: Map nodes into an embedding space
- Similarity of embeddings between nodes indicates their similarity in the network.
- For example: Both nodes are close to each other (connected by an edge)
- Encode network information
- Potentially used for many downstream predictions
similarity(u,v)≈zv⊺zu
Encoder maps from nodes to embeddings
Decoder DEC maps from embeddings to the similarity score
shallow encoder: DeepWalk, Node2Vec
This is unsupervised/self-supervised way of learning node embeddings.
- We are not utilizing node labels and features.
- The goal is to directly estimate a set of coordinates(i.e., the embedding) of a node so that some aspect of the network structure (captured by DEC) is preserved.
These embeddings are task independent
- They are not trained for a specific task but can be used for any task.
Random Walk Approaches for Node Embeddings#
zu: the embedding of node u (what we aim to find).
P(v∣zu): the (predicted) probability of visiting node v on random walks starting from node u.
Non-linear functions used to produce predicted probabilities
Estimate probability of visiting node v on a random walk starting from node u using some random walk strategy R.
cs224w - lecture 3. slide
Optimize embeddings to encode these random walk statistics:
cs224w - lecture 3. slide
Expressivity: Flexible stochastic definition of node similarity that incorporates both local and higher-order neighborhood information.
Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks.
NR(u) … neighbourhood of u obtained by some random walk strategy R
G=(V,E)
maxf∑u∈VlogP(NR(u)∣zu), (Maximum likelihood objective)
- Run short fixed-length random walks starting from each node u in the graph using some random walk strategy R.
- For each node 𝑢 collect NR(u), the multiset* of nodes visited on random walks starting from u.
- Optimize embeddings according to: Given node u, predict its neighbors NR(u).
Equivalently,
L=∑u∈V∑v∈NR(u)−log(P(v∣zu))
Intuition: Optimize embeddings zu to maximize the likelihood of random walk co-occurrences.
Parameterize probability using softmax, because we want node 𝑣 to be most similar to node u (out of all nodes n).
P(v∣zu)=∑n∈Vexp(zu⊺zn)exp(zu⊺zv)
But doing this naively is too expensive! O(∣V∣2) complexity!
Solution: Negative sampling#
Why is the approximation valid?
Technically, this is a different objective. But Negative Sampling is a form of Noise Contrastive Estimation (NCE) which approx. maximizes the log probability of softmax. New formulation corresponds to using a logistic regression (sigmoid func.) to distinguish the target node v from nodes ni sampled from background distribution Pv.
More at word2vec Explained, arXiv
log(∑n∈Vexp(zu⊺zn)exp(zu⊺zv))≈log(σ(zu⊺zv))−∑i=1klog(σ(zu⊺zni)), ni∼PV
w.r.t: with reference to
Instead of normalizing w.r.t. all nodes, just normalize against k random “negative samples” ni.
Two considerations for k (# negative samples):
- Higher k gives more robust estimates
- Higher k corresponds to higher bias on negative events
In practice k = 5-20.
Can negative sample be any node or only the nodes not on the walk?
People often use any nodes (for efficiency). However, the most “correct” way is to use nodes not on the walk.
Summary#
- Run short fixed-length random walks starting from each node on the graph
- For each node u collect NR(u), the multiset of nodes visited on random walks starting from u.
- Optimize embeddings using Stochastic Gradient Descent
What strategies should we use to run these random walks?
- Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013); The issue is that such notion of similarity is too constrained.
node2vec#
Grover and Leskovec, node2vec: Scalable Feature Learning for Networks. KDD. 2016. Paper_PDF
Goal: Embed nodes with similar network neighborhoods close in the feature space.
Key observation: Flexible notion of network neighborhood NR(u) of node u leads to rich node embeddings.
Develop biased 2nd order random walk R to generate network neighborhood NR(u) of node u.
Idea: use flexible, biased random walks that can trade off between local(BFS) and global(DFS) views of the network.
Biased fixed-length random walk R that given a node 𝒖 generates neighborhood NR(u)
Two parameters:
- Return parameter p: Return back to the previous node
- In-out parameter q: Moving outwards (DFS) vs. inwards (BFS)
Intuitively, q is the “ratio” of BFS vs. DFS
Process:
- Compute random walk probabilities
- Simulate r random walks of length l starting from each node u.
- Optimize the node2vec objective using Stochastic Gradient Descent
Other Random Walk ideas#
- Different kinds of biased random walks:
- Based on node attributes (Dong et al., 2017)
- Based on learned weights (Abu-El-Haija et al., 2017)
- Alternative optimization schemes:
- Directly optimize based on 1-hop and 2-hop random walk probabilities (as in LINE from Tang et al. 2015).
- Network preprocessing techniques:
- Run random walks on modified versions of the original network (e.g., Ribeiro et al. 2017’s struct2vec, Chen et al. 2016’s HARP).
Embedding Entire Graphs#
Goal: Want to embed a subgraph or an entire graph G. Graph embedding: zG.
Tasks: Classifying toxic vs. non-toxic molecules, Identifying anomalous graphs
- Approach 1: Simple (but effective)
- Run a standard graph embedding technique on the (sub)graph G.
- Then just sum (or average) the node embeddings in the (sub)graph G.
https://arxiv.org/abs/1509.09292
- Approach 2: Super-node
- Approach 3: Hierarchical Embeddings
Node Embeddings and Matrix Factorization#
Embedding matrix Z, It’s shape. (dim. of embds, # nodes)
Objective: maximize zv⊺zu for node pairs (u,v) that are similar
zv⊺zu=Au,v, which is the (u,v) entry of the graph adjacency matrix A.
Exact factorization A=Z⊺Z is generally not possible.
However, we can learn 𝒁 approximately !
Objective: Zmin∥A−Z⊺Z∥2
- We optimize 𝒁 such that it minimizes the L2 norm (Frobenius norm) of A−Z⊺Z
- Note today we used softmax instead of L2. But the goal to approximate A with Z⊺Z is the same.
Conclusion: Inner product decoder with node similarity defined by edge connectivity is equivalent to matrix factorization of A.
Random Walk-based Similarity#
DeepWalk and node2vec have a more complex node similarity definition based on random walks.
log(vol(G)(T1∑r=1T(D−1A)r)D−1)−logb
vol(G)=∑i∑jAi,j: Volume of graph
T=∣NR(u)∣: context window size
r: Power of normalized adjacency matrix
Du,u=deg(u): Diagonal matrix
b: Number of negative samples
Examples of Embeddings#
Clustering/community detection: Cluster points zi
Node classification: Predict label of node i based on zi
Link prediction: Predict edge (i, j) based on (zi, zj)
- Where we can: concatenate, avg, product, or take a difference between the embeddings:
- Concatenate: f(zi,zj)=g([zi,zj])
- Hadamard: f(zi,zj)=g(zi∗zj) (per coordinate product)
- Sum/Avg: f(zi,zj)=g(zi+zj)
- Distance: f(zi,zj)=g(∥zi−zj∥2)
Graph classification: Graph embedding zG via aggregating node embeddings or anonymous random walks. Predict label based on graph embedding zG.
Limitations#
- Cannot obtain embeddings for nodes not in the training set; Not connected nodes (New user in a social network)
- DeepWalk and node2vec do not capture structural similarity.
- Cannot utilize node, edge and graph features.
Solution to these limitations: Deep Representation Learning and Graph Neural Networks