From Tokens to Knowledge: How BERT Encodes Semantic Structure
I’ve been investigating how language models extract structured knowledge from text. The mechanism is more sophisticated than pattern matching. BERT demonstrates that transformer architectures can learn geometric representations of semantic relationships without explicit knowledge graph supervision.
The Encoding Problem
Computers require numerical representations. Text must be converted to vectors. BERT uses WordPiece tokenization to segment text into subword units, then maps these to 768-dimensional embeddings. But the critical innovation isn’t tokenization. It’s what happens to these embeddings during training.
Through masked language modeling, BERT learns to position word vectors such that semantic relationships become geometric transformations. The model discovers that certain vector operations correspond to predicates in semantic triples.
Bidirectional Architecture and Relational Learning
BERT’s bidirectional attention allows each token to attend to all other tokens simultaneously. This creates a complete graph of token interactions within each layer. The attention weights learn to encode syntactic and semantic dependencies.
Example: When processing “Paris is the capital of France,” BERT’s attention mechanism assigns high weights between “Paris” and “capital” and between “capital” and “France.” The model learns this pattern represents a capital_of(Paris, France)
relation.
The multi-head attention specifically captures different relationship types. One head might specialize in syntactic dependencies while another captures entity relationships. This distributed representation allows the model to encode multiple semantic interpretations simultaneously.
Vector Geometry as Knowledge Representation
BERT’s embeddings organize into a geometry where relationships become vector transformations. Research has shown that:
vec(Paris) - vec(France) ≈ vec(Berlin) - vec(Germany)
This isn’t coincidental. The model has learned that the “capital-of” relationship corresponds to a consistent vector offset. Similar patterns emerge for other relations: “CEO-of,” “founded-by,” “located-in.”
These transformations function as implicit predicates. When BERT processes text containing these relationships, it positions entities in embedding space such that the appropriate transformations connect them. The model has learned a continuous approximation of discrete symbolic relationships.
From Implicit to Explicit Knowledge
BERT stores semantic triples (subject, predicate, object)
as learned parameters rather than explicit database entries. The subject and object are encoded as vectors. The predicate exists as a transformation between them.
This has concrete implications. When fine-tuned for question answering, BERT performs vector arithmetic to retrieve facts. The query “What is the capital of France?” triggers the model to find vectors where the “capital-of” transformation from France leads to a valid entity.
Recent work demonstrates that these implicit knowledge representations can be extracted. Techniques like knowledge probing show that specific neurons activate for particular relations. The model has developed an internal ontology without being explicitly programmed with one.
Computational Significance
BERT proves that neural networks can learn structured knowledge representations from unstructured text. The model discovers semantic triples, encodes them geometrically, and retrieves them through vector operations. This bridges symbolic and connectionist AI approaches through learned geometric structure.
The architecture suggests a path toward systems that don’t merely process language but extract and manipulate the knowledge that language encodes. We’re observing emergence of formal semantics from distributional statistics.
I’m grateful to Casey Keith, Sir Tim Berners-Lee, and Larry Page for their pioneering work that inspired this investigation into knowledge representation and its potential applications in computational biology.
0 Comments