This post based on Lecture 11
of CS224W
Reasoning over Knowledge Graphs#
Goal:
- How to perform multi-hop reasoning over KGs?
Reasoning over Knowledge Graphs
- Answering multi-hop queries
- Path Queries
- Conjunctive Queries
- Query2Box
Predictive Queries on KG#
Can we do multi-hop reasoning, i.e., answer complex queries on an incomplete, massive KG?
KG completion: Is link (h,r,t) in the KG?
One-hop Queries#
One-hop query: Is t an answer to query (h,r)?
- For example:
- Q: What side effects are caused by drug Fulvestrant?
- A: [Headache, Brain Bleeding, Kidney Infection, Short of Breath]
Path Queries#
Generalize one-hop queries to path queries by adding more relations on the path.
An n-hop path query q can be represented by q=(va,(r1,...,rn))
- va is an “anchor” entity,
- Let answers to q in graph G be denoted by [[q]]G.
Query Plan of q:
Query plan of path queries is a chain.
- For example:
- Q: “What proteins are associated with adverse events caused by Fulvestrant?”
- A: [BIRC2, CASP8, PIM1]
va is e:Fulvestrant
(r1,r2) is (r:Causes, r:Assoc)
Query: (e:Fulvestrant, (r:Causes, r:Assoc))
- Answering queries seems easy: Just traverse the graph.
- But KGs are incomplete and unknown:
- For example, we lack all the biomedical knowledge
- Enumerating all the facts takes non-trivial time and cost, we cannot hope that KGs will ever be fully complete
- Due to KG incompleteness, one is not able to identify all the answer entities
위의 그래프와 비교하면 Fulvestrant와 Short of Breath의 Causes
relation이 없다. 이로 인해 BIRC2라는 단백질에 대하여 정상적인 relation을 찾기 어렵다.
Can we first do KG completion and then traverse the completed (probabilistic) KG?
- No! The “completed” KG is a dense graph!
- Most (h,r,t) triples (edge on KG) will have some non-zero probability.
- Time complexity of traversing a dense KG is exponential as a function of the path length L: O(dmaxL)
Answering Predictive Queries on Knowledge Graphs#
Predictive Queries#
We need a way to answer path-based queries over an incomplete knowledge graph.
We want our approach to implicitly impute and account for the incomplete KG.
- Want to be able to answer arbitrary queries while implicitly imputing for the missing information
- Generalization of the link prediction task
Given entity embeddings, how do we answer an arbitrary query?
- Path queries: Using a generalization of TransE
- Conjunctive queries: Using Query2Box
- And-Or Queries: Using Query2Box and query rewriting
(We will assume entity embeddings and relation embeddings are given)
How do we train the embeddings?
- The process of determining entity and relation embeddings which allow us to embed a query.
General Idea#
Map queries into embedding space. Learn to reason in that space
- Embed query into a single point in the Euclidean space: answer nodes are close to the query.
- Query2Box: Embed query into a hyper-rectangle (box) in the Euclidean space: answer nodes are enclosed in the box.
Guu, et al., Traversing knowledge graphs in vector space, EMNLP 2015.
Key idea: Embed queries!
- Generalize TransE to multi-hop reasoning.
Recap: TransE: Translate h to t using r with score function fr(h,t)=−∥h+r−t∥.
Another way to interpret this is that:
Query embedding: q=h+r
Goal: query embedding q is close to the answer embedding t
fq(t)=−∥q−t∥
Given a path query q=(va,(r1,...,rn)),
q=va+r1+⋯+rn
The embedding process only involves vector addition, independent of # entities in the KG!
- Q: “What proteins are associated with adverse events caused by Fulvestrant?”
- A: [BIRC2, CASP8, PIM1]
Query: (e:Fulvestrant, (r:Causes, r:Assoc))
TransE는 path query 처리가 가능하다. compositional relations이 가능하기 때문이다.
TransR / DistMult / ComplEx는 path query 처리에 어려움이 있다.
TransR 역시 compositional relations이 가능하지만, relation들 간의 직접적인 연관아 아니라 projection matrix를 거쳐야 해서 path query 처리가 힘들어 사용하지 않는다.
Conjunctive Queries#
Q: “What are drugs that cause Short of Breath and treat diseases associated with protein ESR2?”
A: [”Fulvestrant”]
((e: ESR2, (r: Assoc, r: TreatedBy)), (e: Short of Breath, (r: CausedBy))
Following the graph, ESR2’s associate relation is missing. Thus, we can’t find Fulvestrant.
How can we use embeddings to implicitly impute the missing (ESR2, Assoc, Breast Cancer)?
Intuition: ESR2 interacts with both BRCA1 and ESR. Both proteins are associated with breast cancer.
How can we answer more complex queries with logical conjunction operation?
- Each intermediate node represents a set of entities, how do we represent it?
- How do we define the intersection operation in the latent space?
Query2Box#
Ren et al., Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings, ICLR 2020
Embed queries with hyper-rectangles (boxes)
- q=(Center(q),Offset(q))
Boxes are a powerful abstraction, as we can project the center and control the offset to model the set of entities enclosed in the box
d: out degree
∣V∣: # entities
∣R∣: # relations
Things to figure out:
- Entity embeddings (# params: d∣V∣):
- Entities are seen as zero-volume boxes
- Relation embeddings (# params: 2d∣R∣):
- Each relation takes a box and produces a new box
- Intersection operator f:
- New operator, inputs are boxes and output is a box
- Intuitively models intersection of boxes
Q: “What are drugs that cause Short of Breath and treat diseases associated with protein ESR2?”
Projection Operator P#
Intuition:
- Take the current box as input and use the relation embedding to project and expand the box!
P:Box×Relation→Box
q′Cen(q′)Off(q′)=(Center(q′),Offset(q′))=Cen(q)+Cen(r)=Off(q)+Off(r)
Geometric Intersection Operator J#
Take multiple boxes as input and produce the intersection box
Intuition:
- The center of the new box should be “close” to the centers of the input boxes
- The offset (box size) should shrink (since the size of the intersected set is smaller than the size of all the input set)
J:Box×⋯×Box→Box
Cen(qinter)=∑iwi⊙Cen(qi)wi=∑jexp(fcen(Cen(qi)))exp(fcen(Cen(qi)))
Cen(qi)∈Rd
wi∈Rd: trainable weights are calculated by a neural network fcen; represents a “self-attention” score for the center of each input Cen(qi).
Intuition: The center should be in the red region!
Implementation: The center is a weighted sum of the input box centers
Off(qinter)=min(Off(q1),...,Off(qn))⊙σ(foff(Off(q1),...,Off(qn)))
min(Off(q1),...,Off(qn)): guarantees shrinking
foff: a neural network that extracts the representation of the input boxes to increase expressiveness
σ: Sigmoid function (0, 1); 1+exp(−x)1
Intuition: The offset should be smaller than the offset of the input box
Implementation: We first take minimum of the offset of the input box, and then we make the model more expressive by introducing a new function foff to extract the representation of the input boxes with a sigmoid function to guarantee shrinking.
Entity-to-Box Distance#
How do we define the score function fq(v) (negative distance)?
(fq(v) captures inverse distance of a node v as answer to q)
Given a query box q and entity embedding (box) v,
qmaxqmindout(q,v)din(q,v)dbox(q,v)fq(v)=Cen(q)+Off(q)∈Rd=Cen(q)−Off(q)∈Rd=∥max(v−qmax,0)+max(qmin−v,0)∥1,=∥Cen(q)−min(qmax,max(qmin,v))∥1,=dout(q,v)+α⋅din(q,v)=−dbox(q,v)
where 0<α<1.
q∈R2d: a query box
v∈Rd: an entity vector
d: the distance
Intuition: if the point is enclosed in the box, the distance should be downweighted.
- Examples:
α is 0.5.
- 1 + 0.5(3) = 2.5; v outside the box →fq(v) = -2.5
- 0 + 0.5(3) = 1.5; v at the line of the box →fq(v) = -1.5
- 0 + 0.5(1) = 0.5; v in the box →fq(v) = -0.5
Union Operation#
Can we embed complex queries with union?
(e.g.: “What drug can treat breast cancer or lung cancer?”)
Conjunctive queries + disjunction is called Existential Positive First-Order (EPFO) queries.
We’ll refer to them as AND-OR queries.
Can we also design a disjunction operator and embed AND-OR queries in low-dimensional vector space?
- No! Intuition: Allowing union over arbitrary queries requires high-dimensional embeddings!
Given 3 queries q1, q2, q3, with answer sets:
- [[q1]]={v1}, [[q2]]={v2}, [[q3]]={v3}
- If we allow union operation, can we embed them in a two-dimensional plane?
If given 4 queries q1, q2, q3, q4 with answer sets:
- We cannot design a box embedding for q2∨q4, that only v2 and v4 are in the box but v3 is in the box.
Conclusion: Given any M conjunctive queries q1,...,qM with non-overlapping answers, we need dimensionality of Θ(M) to handle all OR queries.
- For real-world KG, such as FB15k, we find M≥13,365, where ∣V∣=14,951.
- Remember, this is for arbitrary OR queries.
Can’t we still handle them?
Key idea: take all unions out and only do union at the last step!
Any AND-OR query can be transformed into equivalent DNF, i.e., disjunction of conjunctive queries.
Given any AND-OR query q,
q=q1∨q2∨⋯∨qm. where qi is a conjunctive query.
Now we can first embed each qi and then “aggregate” at the last step!
Distance between entity embedding and a DNF (q=q1∨q2∨⋯∨qm) is defined as:
dbox(q,v)=min(dbox(q1,v),...,dbox(qm,v))
Intuition:
- As long as v is the answer to one conjunctive query qi, then v should be the answer to q
- As long as v is close to one conjunctive query qi,then v should be close to q in the embedding space
The process of embedding any AND-OR query q:
- Transform q to equivalent DNF q1∨⋯∨qm
- Embed q1 to qm
- Calculate the (box) distance dbox(qi,v)
- Take the minimum of all distance
- The final score fq(v)=−dbox(q,v)
How to Train Query2Box#
Overview and Intuition (similar to KG completion):
Given a query embedding q, maximize the score fq(v) for answers v∈[[q]] and minimize the score fq(v′) for negative answers v′∈/[[q]]
Training:
- Sample a query q from the training graph Gtrain, answer v∈[[q]]Gtrain, and non-answer v′∈/[[q]]Gtrain
- Embed the query q.
- Use current operators, to compute query embedding.
- Calculate the score fq(v) and fq(v′).
- Optimize embeddings and operators to minimize the loss L (maximize fq(v) while minimize fq(v′):
L=−logσ(fq(v))−log(1−σ(fq(v′)))