
You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.
WeThe following is a modified excerpt from The Structure and Interpretation of Computer Programs Chapter 1: Building Abstractions with Procedures are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of
rules called a programparameters called a model. Peoplecreate programsrecover models to direct processes. In effect, we conjure the spirits of the computer with our spells.
A computational process is indeed much like a sorcerer’s idea of a spirit. It cannot be seen or touched. It is not composed of matter at all. However, it is very real. It can perform intellectual work. It can answer questions. It can affect the world by disbursing money at a bank or by controlling a robot arm in a factory. The programs we use to conjure processes are like a sorcerer’s spells. They are carefully composed fromsymbolicnumerical expressions in arcane and esoteric programming languages that prescribe the tasks we want our processes to perform.
I. Elements of Networks
Although separated by over 2000 years, the programmers of Silicon Valley face a daunting task quite similar to the one encountered by the mathematicians of Ancient Greece. That is, to continue contributing towards the civilizational progress of augmenting and amplifying human intelligence, they must climb back down from their current summit and backtrack to the beginner’s mind they once had.
Not different from learning another mathematical or programming language, they must transition from their finitely discrete structures and deterministic procedures tooling they have grown acustomed to and make the transition to the infintely continuous structures and stochastic procedures. Back then, ancient greek mathematicians were only comfortable with the finiteness of natural numbers like , , and , and grappled with the infinite of the real numbers such as , , and . Similarly, the programmers of today are being asked to transition from programming algorithms of sets, lists, associations, trees, and graphs to the distributions of scalars, vectors, matrices, tensors, and neural networks.
More coloquially, programmers interested in the deep learning approach to artificial intelligence must make the transition from software 1.0software 1.0 to software 2.0software 2.0See (Karpathy 2017), a distinction used to differentiate the classical act of programming software line by line, and the newer approach of programming software by specifying a dataset, a neural net architecture with a goal, and searching the space of programs with compute. How to exactly program with this new approach will take the remainder of the book to explain.
While software 2.0 has increased the intelligence and autonomy of our devices throughout the past decade
— to name a few, language translation with Google Translate, speech recognition with Apple’s Siri, vision with Tesla Autopilot —
at the end of 2022 ChatGPT was released to the world which if a singular point had to be defined, was the true beginning
of software 3.0software 3.0See (Andrej Karpathy 2025),
programming at the highest abstraction level possible with none other than English.
What may be surprising to realize at first glance is that artificial intelligence like ChatGPT is “just” another computer program.
However, rather than being implemented in a language like C, Java, or Javascript, it’s implemented in one that goes by the name of PyTorch,
a software 2.0 programming language centered around the multidimensional array data structure known as torch.Tensor, humbly embedded within a Python package.
In this whirlwind tour dubbed The Structure and Interpretation of Tensor Programs (SITP), we will embark on a quest to build from scratch
our own deep neural network like ChatGPT by implementing nanochat
and our own deep learning framework like PyTorch by implementing teenygrad.
Whether you’re an eager high school student, an up coming college student, or a battle-tested industry programmer,
SITP has been meticulously designed so that the only prerequisite required is a basic familiarity with the art of programming, and high school algebra. Any additional experience is helpful, but not mandatory.
So with that all said, go on young hacker. Venture forth!
Table of Contents
- 0. From Symbolic Software 1.0 to Stochastic Software 2.0
- 0.1 From Psychology to Artificial Intelligence
- 0.2 Weizenbaum’s Turing Test Cheating with the Pattern Matching of
ELIZA - 0.3 Winograd’s Understanding with the Syntactic and Semantic Analysis of
SHRDLU - 0.4 Feigenbaum’s Narrow Expertise with the _ of
DENDRAL - 0.5 Lenat’s Scruffiness with the Ontology and Inference of
CYC - 0.6 Wittgenstein’s Bitter Lesson: From A Logical to Distributional Semantics
- Intermezzo One: The Language of Dependent Type Theory
- 1. Syntactic Sequence Learning with n-gram Models
- 1.2 From Certain to Uncertain Knowledge with Probabilities
- 1.3 Composing Probabilities with Sums, Products, and Conditionals
- 1.4 Random Variables and their Distributions, Expectations, and Variance
- 1.5 Analytical Probabilities from Data by Counting
- 1.6 From Evaluating to Recovering Probabilities with Statistics
- 1.7 Summary
- 1.8 Bibliographic Notes
- Intermezzo Two: The Language of Probability Theory
- 2. Semi-Automated Semantic Sequence Learning with Linear Models
- 2.1 From Representing Discrete Data to Dimensional Data
- 2.2 Predicting Scalars by Linearly Modeling Regression
- 2.3 Directly Fitting Linear Regression with Squared Error Loss
- 2.4 Predicting Categories by Linearly Modeling Classification
- 2.5 Iteratively Fitting Logistic Regression with Cross Entropy Loss
- 2.6 Empirical Risk Minimization is Maximum Likehood Estimation
- 2.7 Summary
- 2.8 Bibliographic Notes
- Intermezzo Three: The Language of Matrix Calculus
- 3. From the Array of IPL to the Multidimensional Array of APL
- 3.1 From Abstract to Numerical Linear Algebra: Real to Floating Arithmetic
- 3.2 Virtualizing Logical Shape to Physical Storage with Strides
- 3.3 Implementing Basic Linear Algebra
- 3.4 From Virtual to Physical Machines: Translation and Evaluation
- 3.5 Accelerating
AXPYandGEMVwith Rooflines - 3.6 Accelerating Communication with Hierarchies
- 3.7 Accelerating Computation with Pipelines
- 3.8 Summary
- 3.9 Bibliographic Notes
- Intermezzo Four: The Language of Rust and RISC-V
0. From Symbolic Software 1.0 to Stochastic Software 2.0
In which we retrace the development of the discretely symbolic approach to building artificially intelligent machines with common sense, and ultimately, it’s failure to describe a reality with too many parts to count, requiring us to transition from deterministically discrete software 1.0 to the stochastically continuous software 2.0.
0.1 From Psychology to Artificial Intelligence
The study of the mind is no different from that of mathematics and music — although forms can change throughout time, their substances remain eternal. In mathematics, representations for arithmetic have evolved from dashes on cave walls, to roman numerals we can still find in books, and finally to modern position-based hindu arabic numerals; In music, representations for pitch have evolved from neumes, relative staffs, and finally to the five-line staff; With the mind, representations for intelligence have evolved from stimulus-response to neural networks, which happened with the birth of the discpline known as artificial intelligence at the Dartmouth Workshop in 1956.
There, a group of researchers unsatisfied with the representations that the discipline of psychology were using to conduct the study of mind came together to discuss an alternate approach. Practical applications are usually a result of inquiry which is usually philosophical, esoteric, and with no immediately economic value. Namely, computers with automating mathematics, and language models with mechanizing the mind. Namely, to use the new way of thinking and conducting science endowed by computers which involves describing complex phenomena by simulating it with procedural epistemologyprocesses. In it’s proposalSee A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence (McCarthy, Minsky, Rochester, Shannon 1955):
We propose that a 2-month, 10-man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.
It’s safe enough to say that 20 person months was overly optimistic on the effort required.
In chapter 0, we will walk through early attempts to build artificial intelligence using symbolic methods.
- Marvin Minsky (1969) and John McCarthy (1971) for defining the foundations of the field based on representation and reasoning;
- Allen Newell and Herbert Simon (1975) for symbolic models of problem solving and human cognition;
0.2 Cheating the Turing Test with Pattern Matching of ELIZA
Tip
Try interacting with both ELIZA and ChatGPT up above to viscerally feel the limitations encountered, overcome, and presented to the world in 2022. To provide a rough reference, the manufacturing of the Boeing 747 started and ended around the same time period.
The original paper that presents ELIZASee A Computer Program For the Study of Natural Language Communication Between Man and Machine (Weizenbaum 1966), begins and ends by explaining the inner workings of the chatbot at a high level:
The gross procedure of the program is quite simple. Input sentences are analyzed on the basis of decomposition rules which are triggered by key words appearing in the input text. Responses are generated by reassembly rules associated with selected decomposition rules.
So at a high level ELIZA interprets language by pattern matching over certain keyword strings and then producing a response based on predefined reassembly rule. Conveniently, Weizenbaum provides a more concrete example further in the paper:
Consider the sentence “I am very unhappy these days”. Suppose a foreigner with only a limited knowledge of English but with a very good ear heard that sentence spoken but understood only the first two words “I am”. Wishing to appear interested, perhaps even sympathetic, he may reply “How long have you been very unhappy these days?”
What he must have done is to apply a kind of template to the original sentence, one part of which matched the two words “I am” and the remainder isolated the words “very unhappy these days”. He must also have a reassembly kit specifically associated with that template, one that specifies that any sentence of the form “I am BLAH” can be transformed to “How long have you been BLAH”, independently of the meaning of BLAH.
A somewhat more complicated example is given by the sentence “It seems that you hate me”. Here the foreigner understands only the words “you” and “me”; i.e., he applies a template that decomposes the sentence into the four parts:
(1) It seems that (2) you (3) hate (4) me
of which only the second and fourth parts are understood. The reassembly rule might then be “What makes you think I hate you”; i.e., it might throw away the first component, translate the two known words (“you” to “I” and “me” to “you”) and tack on a stock phrase (“What makes you think”) to the front of the reconstituted sentence.
This concrete example shows the inspiration and guiding principles for both the data structures to and algorithms used to represent ELIZA’s knowledge and reasoning respectively.
First, not only does the foreigner representing language with symbolic strings, but the foreigner also has a limited, fractional understanding of such vocabulary. So, ELIZA will represent it’s knowledge with strings.
This is quite similar from the tradition in philosophy which represents word meaning by simply spelling the word with capital letters.
The illustrative example is provided by linguist Barbara Partee (1967) where the meaning of life is (represented with) the string "LIFE".
Second, the foreigner produces an answer by transforming a matched template to an incoming sentence with a pre-defined reassembly kit. So, ELIZA will reason by pattern matching over such strings with if-then transformation rules.
Let’s reproduce some of the example scripts presented in the paper’s appendix by re-implementing it’s pattern-match-over-string functionality with Python systematically following How to Design Programs’ Design Recipe. We will walk through the recipe step by step for this first example — all following code examples in the book will compress such process. First, words will be represented with Python strs. So second, the function definition will look like:
def eliza(input: str) -> str:
return NotImplementedError("todo")
Important
Run code examples by clicking the play button in the top right hand of the code snippet.
Third, some examples. Let’s start with the transformation rule where if the user prompts with "I am unhappy" then our mini ELIZA responds with "Why do you say you are unhappy?". Python’s standard library conveniently provides the capability to test interactive examples with docttest. Evaluating the code fails, as expected:
def eliza(input: str) -> str:
"""
>>> eliza("I am unhappy")
'Why do you say you are unhappy?'
>>> eliza("I am upset")
'Why do you say you are upset?'
"""
return NotImplementedError("todo")
if __name__ == "__main__":
import doctest
doctest.testmod()
Although ELIZA’s matching and transformation rules were implemented using decomposition and reassembly
rulesWeizenbaum was a programming language implementor as well, for he also was the creator of the SLIP programming language itself, presented in Symmetric List Processor (Weizenbaum 1963).,
for the fourth and fifth steps, rather than implement our own string matching algorithm,
we will reuse regular expressionsregular expressions
which are highly specialized domain specific languages, namely, in the domain of matching text,
and are are provided by the standard libraries of various languages such as Python, Perl, and Java.
They were originally conceived as notation by Stephen Kleene to formally define the class of regular languages,
regular expressions became mainstream amongst programmers when Ken Thompson built in Kleene’s notation into the QED editor (and later ed)
in order to implement search eventually leading to the commonly used grep, standing for (g)lobal (r)egular (e)xpression and (p)rint.
For Python specifically, regular expressions are accessible via the re module,
where you re.compile(pattern)
a regular expression into a re.Pattern,
which can subsequently be called with Pattern.search(string), Pattern.match(string) and Pattern.fullmatch(string) to
return a corresponding re.Match or None.
For example:
import re
prog = re.compile(r"I am unhappy")
result1 = prog.fullmatch(r"I am unhappy")
result2 = prog.fullmatch(r"You are unhappy")
print(f'{result1=}')
print(f'{result2=}')
However if the pattern is only going to be used once without any reuse, you can evaluate the re.Pattern and re.Match objects with one call:
import re
result = re.fullmatch(r"I am unhappy", "You are unhappy") # returns None
print(f'{result=}')
In the case of ELIZA, we need some way of referencing whatever the user is self reporting so that our system can respond asking why
so that our doctest examples pass. We can do this with a capture group.
todo, after fri/sat/sun
import re
Note
Although we are embarking on a quest to re-implement a Tensor compiler like PyTorch with teenygrad, because the language is embedded into Python, no text parsing is actually required like that of classical compilers (albeit PyTorch’s fusion compiler does have an elaborate way of hooking into the host language of Python).
Interestingly enough though, regular expressions are in fact used to build ChatGPT and subsequently nanochat in what is called tokenization. Things come full circle as Stephen Kleene was originally motivated to describe the artificial neural networks of McCullogh and Pitts.(Kleene 1951 p46) So besides illustrating ELIZA’s GOFAI rules-based approach with regular expressions, becoming somewhat familiar with them will also help you with ChatGPT and nanochat’s tokenization process in Part II. Neural Networks
import re
def eliza(input: str) -> str:
"""
>>> eliza("I am unhappy")
'Why do you say you are unhappy?'
>>> eliza("I am upset")
'Why do you say you are upset?'
"""
pattern, response = r"I am (.*)", "Why do you say you are {0}?"
if m := re.fullmatch(pattern, input, re.IGNORECASE): return response.format(*m.groups())
else: return "Please go on."
print(eliza("I am very unhappy these days")) # How long have you been very unhappy these days?
Finally, let’s create a list of transformation rules
import re, random
transformation_rules = [
(r"I am (.*)", ["How long have you been {0}?", "Why do you say you are {0}?"]),
(r"It seems that (.*)", ["What makes you think {0}?", "Why does it seem that {0}?"]),
(r"(.*) hate (.*)", ["Why do you think {0} hates {1}?", "What makes you feel that way?"]),
(r"(.*)", ["Please go on.", "Tell me more.", "Can you elaborate?"]),
]
and subsequently modify the function body to follow the data by looping over such rules:
import re, random
transformation_rules = [
(r"I am (.*)", ["How long have you been {0}?", "Why do you say you are {0}?"]),
(r"It seems that (.*)", ["What makes you think {0}?", "Why does it seem that {0}?"]),
(r"(.*) hate (.*)", ["Why do you think {0} hates {1}?", "What makes you feel that way?"]),
(r"(.*)", ["Please go on.", "Tell me more.", "Can you elaborate?"]),
]
def eliza(input: str) -> str:
"""
>>> eliza("I am unhappy")
'Why do you say you are unhappy?'
>>> eliza("I am upset")
'Why do you say you are upset?'
"""
for pattern, responses in transformation_rules:
if m := re.fullmatch(pattern, input, re.IGNORECASE): return random.choice(responses).format(*m.groups())
print(eliza("I am very unhappy these days")) # How long have you been very unhappy these days?
print(eliza("It seems that you hate me")) # What makes you think you hate me?
Although we have only implemented a small fraction of the functionalityThe curious reader whom wishes to explore more transformation rules should play around with the embedded ELIZA up above, or take a look at the Appendix in (Weizenbaum 1963). (which, to be fair, the original ELIZA program is only a few hundred lines of code itself),
it illustrates the simplest discrete data structures with strs and deterministic algorithms with if statements used to implement ELIZA
and is representative of the overall GOFAI approach.
Very specifically, with the last transformation rule, you can see how brittle ELIZA’s so-called “understanding” truly is, following quite closely to the foreigner which Weizenbaum literally uses as inspiration.
The primary reason a simple pattern matcher over strings can be endowed with human “understanding” (in other words, why the magic works)
is because of the psychiatric context — especially the Rogerian one with person-centered therapy — where users are effectively talking with oneselves.
Weizenbaum goes on to report that:
This mode of conversation was chosen because the psychiatric interview is one of the few examples of categorized dyadic natural language communication in which one of the participating pair is free to assume the pose of knowing almost nothing of the real world. If, for example, one were to tell a psychiatrist “I went for a long boat ride” and he responded “Tell me about boats”, one would not assume that he knew nothing about boats, but that he had some purpose in so directing the subsequent conversation. It is important to note that this assumption is one made by the speaker. Whether it is realistic or not is an altogether separate question. In any case, it has a crucial psychological utility in that it serves the speaker to maintain his sense of being heard and understood.
While ELIZA’s pattern matching over string methodology with regular expression capture groups and substitutions can perhaps be argued to be sufficient in the psychiatric context, it truly breaks down in others. Even though we can justify this post hoc by simply comparing ELIZA to that of ChatGPT, there is an a priori principle which helped the practitioners of the field to predict ex ante that ELIZA was not the end all be all of a natural language processing system. That principle is the distinction between syntactical formsyntax and semantic meaningsemantics, which practitioners (including Weizenbaum himself!)In a 1978 interview, “Well, I would deny that that there’s any important sense, non-negligible sense in which the program understands. It certainly creates the illusion of understanding. there’s no question about that. But we have to understand that that illusion is an attribution that the person conversing with the program contributes to the conversation. It’s not a function of the program itself.” understood that ELIZA is closer to the former. That is, the system can be described as one with “all style but no substance”:
The ELIZA program itself is merely a translating processor in the technical programming sense. Gorn [2] in a paper on language systems says: ‘Given a language which already possesses semantic content, then a translating processor, even if it operates only syntactically, generates corresponding expressions of another language to which we can attribute as “meanings” (possibly multiple — the translator may not be one to one) the “semantic intents” of the generating source expressions; whether we find the result consistent or useful or both is, of course, another problem.’
Following the last quote from the Weizenbaum paper, ELIZA was indeed not simply a one to one translating processor but a one to many.
Before jumping ship from software 1.0’s good old fashioned discrete data structures and deterministic algorithms,
perhaps to build a semantic natural language processing system like ChatGPT all we need to do the same is increase the expressivity
used for word representations and reasoning? Afterall, all we used were strs and if statements respectively.
Perhaps there’s more to life than "LIFE", and plus, you can hardly call pattern matching an algorithm.
0.3 Understanding with Syntactic and Semantic Analysis of LUNAR and SHRDLU
early q&a systems
- baseball (galaxy. what’s the meaning of life? 42)
- lindsay, buchanan survey: https://dl.acm.org/doi/pdf/10.1145/363707.363732
using first order logic.
- w.a woods
- (1967): https://web.stanford.edu/class/linguist289/woods.pdf
- LUNAR (w.a woods 1973): https://dl.acm.org/doi/pdf/10.1145/1499586.1499695
- https://www.youtube.com/watch?v=vlgVYRXe68Q
- then SHRDLU winograd (1971) https://dspace.mit.edu/entities/publication/3406b826-5a41-4b72-9430-7876c0d5405e
0.4 Feigenbaum’s Narrow Expertise with the _ of DENDRAL
0.5 Minsky’s Society of Mind with Representation and Reasoning of CYC
semantic networks
- 1974 minsky’s framework for representing knowledge
- 1977 bobrow and winograd’s KRL
- 1985 brachman and schmolze KL-ONE
- 1984 CYC paper
0.6 Wittgenstein’s Bitter Lesson: From A Logical to Distributional Semantics
lighthill report (1973)
then expert systems in ’80s (feigenbaum and raj reddy), expert systems being abandoned in ’90s, creating the second winter. Parallel Distributed Processing (Rumelhart and McClelland, 1986)
although obvious posthoc that neural networks, this was all predicated with foresight by wittgenstein.
- from feigenbaum/reddy to pdp
- from the organon (knowledge representation and reasoning with upper ontologies and deductive inference) to norvum organon (occam’s razor)
- from the tractatus to the investigations is effectively the transition from software 1.0 to sofware 2.0
- to understand the claude, we must return to claude
- data science begins where computer science begins
question answering systems eventually incorporated the web as it’s knowledge base, and the field of information retrieval emerged. https://start.csail.mit.edu/index.php
0.7 Summary
One quick way to summarize the software 1.0 approach to AI is to list the Turing Award winners: Marvin Minsky (1969) and John McCarthy (1971) for defining the foundations of the field based on representation and reasoning; Allen Newell and Herbert Simon (1975) for symbolic models of problem solving and human cognition; Ed Feigenbaum and Raj Reddy (1994) for developing expert systems that encode human knowledge to solve real-world problems;
The remainder of the book is dedicated to the software 2.0 approach to AI: Judea Pearl (2011) for developing probabilistic reasoning techniques that deal with uncertainty in a principled manner; Yoshua Bengio, Geoffrey Hinton, and Yann LeCun (2019) for making “deep learning” (multilayer neural networks) a critical part of modern computing; and finally, Richard Sutton, Andrew Barto (2024) for pioneering reinforcement learning in which agents learn by maximizing reward via trial and error
0.8 Bibiliographic Notes
For historical, (Gardner 1958) and (Nilsson 2010) Part II (Chapter 6, 7), Part III (Chapter 11, 13), Part IV (Chapter 18), Part VII (Chapter 26). For GOFAI techniques, (Eisenstein 2018) Chapter 9, 10, 11, 12, 13 (Jurafsky and Martin 2026) Appendix F, G, H, I, J, K, (Bird, Klein, and Loper 2009) Chapter 7, 8, 9, 10 and (Russel and Norvig 2022) Part III (Chapter 7, 8, 9, 10) and Part VI (Chapter 24). For logical foundations, (Brachman and Levesque 2004), (Sowa 2000) and (Genesereth and Nilsson 1987).
rough
In fact, we are in good company as the question was posed all the way back by Alan Turing in his paper Computing Machinery and Intelligence (Turing 1950) as a test to operationalize intelligence, seminating both the fields of artificial intelligence and computer science as we know it today. That same decade marks the official branding of the field at the Dartmouth Workshop (1956), and later that year, a possible solution to build machines that could pass such Turing’s test was proposed by John McCarthysee Oral History of John McCarthy in his paper Programs with Common Sense (McCarthy 1956) where he outlines a hypothetical advice taker program implemented with logically symbolic methods.
Although probabilistically connectionist methods were also explored early on with the artificial neuron — the modelling of a neuron as a weighted sum and threshold function — was invented by Warren McCulloch and Walter Pitts in A Logical Calculus of the Ideas Immanent in Nervous Activity (1943), studied by Marvin Minskysee Oral History of Marvin Minsky shortly after The Society of Mind and the Emotional Machine in his PhD thesis Theory of Neural-Analog Reinforcement Systems and Its Application to the Brain Model Problem (1954) (both before Turing proposed his test), and followed up by Frank Rosenblatt’s composition of such neurons within a single layer neural network i The Perceptron — A Perceiving and Recognizing Automaton (1973), historically speaking the first dominant approach to the artificial intelligence program was the logically symbolic methods espoused by McCarthy.
While it may be clear from today’s vantage point that deep neural networks (more specifically large language models via generative pretrained transformers) are the first systems with common sense that have arguably passsed the aforementioned Turing test, the apriori argument from the logically symbolic school of thought is that modelling individual neurons was similar to a programming model with individual transistors. That is, the former is attempting to describe reality at the wrong abstraction level, and instead, the correct approach is to go straight to the source (so to speak) by replicating the human ability to aquire knowledge through reasoning. This was crisply posited into a single claim known as the physical system hypothesis by Allen Newell and Herbert Simon in Completer Science as Empirical Inquiry: Symbols and Search (1976), These divergent schools of thought to the program of artificial intelligence are respectively known as connectionismconnectionism and symbolismsymbolism. with the latter now coloquially referred to (in a somewhat disparaging manner) GOFAIgood old fashioned AI due to the fact that deep neural networks have been the dominant approach to the artificial intelligence program for the past decade.
As what tends to happen, the dichotomy is moreso a fuzzy spectrum, and their relationship is closer to that of fighting siblings borrowing ideas from one another. Two related tensions which we will explore and transition throughout the book (especially in Part I. Elements of Networks) is in both the mathematical objects and methods used to implement the two different approaches. Objects, meaning the transition from the discretediscrete mathematics of sets, lists, associations, trees, and graphs, to that of the continuouscontinuous mathematics of scalars, vectors, matrices, and tensors. And methods, meaning the continuous objects are in fact non-constructivenon-constructive, and in order to implement them on computers we must constructiveconstructively discretize them with numerical approximations.. By the end of this Part I. Elements of Networks you may wonder if real numbers, are in fact, real.
(start here.)
Returning to the central question of how can we build machines that can understand and produce natural language?, one of the earliest and simplest GOFAI symbolic attempts was that of ELIZA, a computer chatbot implemented in 1966 which simulates a Rogerian psychotherapist conducting person-centered therapy
Systems like ELIZA which followed GOFAI’s approach ran into limitations, ultimately predicted by the philosopher Ludwig Wittgenstein even before the Dartmouth project took place in his book Philosophical Investigations (Wittgenstein 1953) — namely, that of describing a reality with too many parts to count. This is precisely what the connectionist school of thought also known as softwere 2.0 solves with neural networks, and the brain child of such tradition is none other than ChatGPT.
Intermezzo One: The Language of Dependent Type Theory
1. Syntactic Sequence Learning via n-gram Models with teenygrad
1.2 From Certain to Uncertain Knowledge with Probabilities
1.3 Composing Probabilities with Sums, Products, and Conditionals
1.4 Random Variables and their Distributions, Expectations, and Variance
1.5 Analytical Probabilities from Data by Counting
# this is an SITP adapted bigram character-level language model taken from karpathy's zero to hero curriculum
# - syllabus: https://karpathy.ai/zero-to-hero.html
# - lecture: https://www.youtube.com/watch?v=PaCmpygFfXo
# - lecture notebook: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part1_bigrams.ipynb
# - makemore: https://github.com/karpathy/makemore/blob/master/makemore.py#L399
# We implement a bigram character-level language model,
# which we will further complexify in followup videos into a modern Transformer language model, like GPT.
# In the lecture above, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently
# evaluating neural networks and (2) the overall framework of language modeling that includes model training,
# sampling, and the evaluation of a loss (e.g. the negative log likelihood for classification).
dataset = open('./examples/data/names.txt', 'r').read().splitlines()
N = len(dataset)
# Adding some more context to Karpathy's lecture code, the bigram language model is our first software 2.0 "program"
# that describes language probabilistically following the principle of distributional semantics (Harris 1954).
# That is, rather than represent language deterministically with classic GOFAI systems like ELIZA and WordNet,
# the bigram language model represents language stochastically as a random variable Y|X endowed with a conditional probability distribution p(y|x),
# and of course an underlying probability space with the sample space (set of all outcomes) and event space (the set of all possible outocmes).
# Of course, if the stochastic phenomena being described is admits equally likely sample spaces (i.e coin, dice, and cards)
# we can reason, argue, and justify the construction of the associated probability distribution a priori.
# However, it's not so clear what the distribution of *language* is.
# So, we must construct such conditional probabilities *from* data X^(1), X^(2),...,X^(n) by recovering the distribution via *parameter estimation*,
# which is an optimzation problem via maximizing likelihood, which you know from Chapter 2 is equivalent to minimizing empirical risk.
# The goal of the bigram language model is to *recover* such
# autoregressive sequence models (classification likelihood)
# x|y <- (peter abeel/unsupervised/self-supervised)
# ...Histogram (counting frequencies) is the most precise model for training set. it *is* the training set. but it generalizes poorly.
# ...Below, we are building...
conditional_counts_dict = {}
for di in dataset:
di_normalized = ['<S>'] + list(di) + ['<E>']
for x_char,y_char in zip(di_normalized, di_normalized[1:]): # here we
conditional_counts_dict[(x_char,y_char)] = conditional_counts_dict.get((x_char,y_char), 0) + 1
sorted_conditional_counts_dict = sorted(conditional_counts_dict.items(), key = lambda x: -x[1])
# print("2D (char,char) histogram using python's dict:\n", sorted_conditional_counts_dict)
print("--- TRAINING ---")
# We will now construct the same 2d histogram, but with numpy's ndarray instead of python's dict
import numpy as np # we use numpy rather than torch to emphasize that the bigram model does NOT required any deep learning functionality
vocab = sorted(list(set(''.join(dataset)))) # construct vocab
c2i = {c:i+1 for i,c in enumerate(vocab)} # construct map<char,ord>
c2i['.'] = 0 # with . as the start token and end token, to remove counting freq of (<E>*) and (*<S>) which are all 0
V = len(c2i) # evaluate the vocab len V
C_VV = np.zeros((V,V), dtype=np.int32) # and use V to construct C_VV
for di in dataset:
di_normalized = ['.'] + list(di) + ['.']
for x_char,y_char in zip(di_normalized, di_normalized[1:]):
x_index, y_index = c2i[x_char], c2i[y_char] # use map<char, ord> to lookup the coordinate index needed for C_VV
C_VV[x_index, y_index] += 1 # update C_VV
# normalize counts C_VV to probs P_VV
C_VVf32 = (C_VV+1).astype(np.float32) # inductive bias (locally smooth)
s_V1 = C_VVf32.sum(axis=1,keepdims=True) # reduce along axis=1 because we want p(y|x) not p(x|y)
P_VV = C_VVf32 / s_V1 # (V, V) / (V, 1) broadcasts
# because the data matrix D_VV is 2 dimensional, we can print it.
# the way to semantically interpret the table is that is to first understand which axis of the ndarray represents which semantic information
# for D_VV, the elements are the counts of bigrams (c1,c2) which is acessed by ord(c1) with axis=0 and ord(c2) with axis=2
# now, since numpy's ndarray's are row major order, axis=0 gets printed vertically from up to down while axis=1 gets printed horizontally from left to right
i2c = {i:c for c,i in c2i.items()} # invert map<char, ord> to map<ord, char> because looping with enumerate provides access to indices
header = ' ' + ' '.join(f'{i2c[y_index]:>4}' for y_index in range(V))
print("2D (ord, ord) histogram using numpy ndarray")
print(header)
for x_index, row in enumerate(C_VV+1):
x_char = f'{i2c[x_index]:>4}'
print(x_char, ' '.join(f'{count:>4}' for count in row))
print("\n\n--- INFERENCE ---")
rng = np.random.default_rng(1337)
sample_count = 10
for _ in range(sample_count):
output, sample_index = [], 0
while True:
# 1. get p(Y|X)
pYcondX_V = P_VV[sample_index].squeeze()
# 2. sample p(Y|X)
sample_index = rng.choice(len(pYcondX_V), size=1, replace=True, p=pYcondX_V)
sample_char = i2c[sample_index.item()]
output.append(sample_char)
# 3. if sampled end token, break
if sample_index == 0: break
print(''.join(output))
loglikelihooddataset,n = 0.0, 0
for di in dataset:
di_normalized = ['.'] + list(di) + ['.']
for x_char,y_char in zip(di_normalized, di_normalized[1:]):
x_index, y_index = c2i[x_char], c2i[y_char] # use map<char, ord> to lookup the coordinate index needed for D_VV
pycondx = P_VV[x_index, y_index] # maximize likelihood
logpycondx = np.log(pycondx) # maximize loglikelihood
loglikelihooddataset += logpycondx
n += 1
# print(f'{x_char}{y_char}: {pycondx:.4f} {logpycondx:.4f}')
nlldataset = -loglikelihooddataset # minimize -loglikelihood
avgnlldataset = nlldataset / n # minimize -1/n loglikelihood
print(f'{loglikelihooddataset=}')
print(f'{nlldataset=}')
print(f'{avgnlldataset=}')
1.6 Learning Probabilities from Data by Parameter Estimation
1.7 Summary
1.8 Bibliographic Notes
Intermezzo Two: The Language of Probability Theory
Further Learning
- Stanford CS109 Probability for Computer Scientists (Piech)
- Purdue ECE 302 Probability for Data Science (Chan)
- MIT 6.012 Introduction to Probability (Tsitsiklis)
- Carnegie Mellon 36-705 Intermediate Statistics (Wasserman)
2. Semi-Automated Semantic Sequence Learning via Linear Models with teenygrad
from
2.1 From Representing Discrete Data to Dimensional Data
Q: How do language models represent the meaning of words?
A: As high dimensional random vectors in coordinate spaces
Recall Chapter 1’s guiding question from above. After the previous three chapters, with ngram language models, you now have a better understanding of the first difference and required transition between software 1.0 and software 2.0 which is to move from deterministic localist representations of ELIZA to the stochastic distributional representations of ChatGPT, following the principle of distributional semantics by (Harris 1954) and (Firth 1957).
In the next three chapters, you will come to have a better understanding of the second difference and transition which is to move from the discrete symbolic representations of ELIZA (illustrated by WordNet) strings, dictionaries, and graphs to the high dimensional representations of ChatGPT, following the principle of vector semantics by (Osgood et al. 1957).
For this, we introduce the word meaning representations learned by word2vec, in which the Tensorflow Embedding Projector below projects the coordinates from a 200-dimensional space down to 3-dimensional space for our cortex to visualize. In the remaining part of Chapter 1, our focus is on reproducing the learning algorithm that the embedding projector is performing, which is the unsupervised learning of a subspace using the dimensionality reduction of principal component analysis — for this, we introduce the wonderful language of linear algebra.
Are you ready to begin?
vectorvector Grant Sanderson’s 3Blue1Brown Essence of Linear Algebra Chapter 1: Vectors
euclidean spaceeuclidean space
import numpy as np
king_D = np.array([-0.60581, -0.00148, 0.33867, 0.34192, 0.14056, 0.12235, 0.29698, 0.01794, -0.32272, 0.30459, 0.46147, 0.42833, 0.35232, 0.19600, 0.07191, -0.31273, -0.35806, 0.17491, 0.15752, -0.38974, 0.15785, -0.11237, 0.16996, 0.07034, -0.29624, -0.20062, -0.11649, 0.23958, 0.12123, -0.05922, -0.25320, -0.15877, -0.04878, -0.17475, 0.17761, -0.17073, -0.04392, 0.49176, 0.20456, 0.71964, -0.24501, 0.23090, -0.20057, 0.29144, 0.34997, -0.47062, 0.13777, 0.08265, 0.38853, 0.02435, -0.15429, -0.17934, -0.06365, -0.12415, -0.10394, -0.18429, -0.35483, -0.21525, 0.05942, 0.19793, 0.09521, -0.24977, -0.33427, 0.04912, -0.24623, 0.43333, -0.12799, 0.28146, 0.04764, 0.21654, -0.27160, 0.61729, -0.05107, -0.30553, 0.29249, -0.15464, -0.14522, 0.24393, -0.18919, -0.43948, 0.52454, 0.22005, -0.44827, -0.39906, 0.40726, -0.21335, -0.07971, -0.28354, 0.11108, 0.01481, -0.37292, 0.17809, 0.03488, -0.13215, 0.17253, -0.11791, -0.13976, -0.23828, 0.22592, -0.41355, 0.12601, 0.83641, 0.18459, -0.00347, 0.28769, 0.10095, 0.19476, 0.50792, 0.04113, 0.16390, -0.23309, 0.42694, -0.06147, 0.07271, 0.07544, -0.04256, -0.04545, -0.27834, 0.24494, -0.16883, 0.15000, -0.08949, -0.05867, -0.00903, 0.14529, 0.24702, -0.18937, -0.28886, -0.24008, -0.04227, -0.37717, -0.28747, 0.26174, -0.35527, 0.03815, -0.00708, 0.09371, 0.29560, 0.41579, 0.13900, 0.04153, -0.12024, -0.06554, -0.11634, -0.07988, 0.20049, -0.17647, -0.00582, 0.42376, -0.15636, 0.27413, -0.11051, 0.00478, -0.30303, 0.10362, -0.27428, 0.08752, -0.15209, 0.41466, -0.04700, -0.21136, 0.11834, -0.16985, 0.20846, 0.20791, -0.12102, 0.00275, 0.24892, 0.11430, -0.20873, -0.07238, -0.05436, -0.41899, -0.62990, 0.23796, -0.06399, -0.49256, 0.19141, -0.37416, -0.34462, 0.25972, -0.51707, -0.46258, 0.10827, 0.16233, 0.08140, 0.22514, -0.26821, -0.59703, -0.03896, 0.04064, -0.24537, -0.12389, 0.16034, -0.30174, -0.15281, -0.33583, -0.25088, 0.15511, 0.08964], dtype=np.float32)
print("king_D.shape", king_D.shape)
print("king_D.storage", king_D)
matrix vector multiplicationmatrix multiplication
data matrixdata matrix
import numpy as np
try:
from jaxtyping import Float, jaxtyped
from beartype import beartype
except ImportError:
import typing
class _Subscriptable:
def __class_getitem__(cls, _): return typing.Any
Float = _Subscriptable
jaxtyped = lambda typechecker=None: (lambda fn: fn)
beartype = lambda fn: fn
# indices (N=10000): https://raw.githubusercontent.com/tensorflow/embedding-projector-standalone/master/oss_data/word2vec_10000_200d_labels.tsv
# embeddings (D=200): https://raw.githubusercontent.com/tensorflow/embedding-projector-standalone/master/oss_data/word2vec_10000_200d_tensors.bytes
dict_string2dist: dict[str, np.ndarray] = {
"king": np.array([-0.60581, -0.00148, 0.33867, 0.34192, 0.14056, 0.12235, 0.29698, 0.01794, -0.32272, 0.30459, 0.46147, 0.42833, 0.35232, 0.19600, 0.07191, -0.31273, -0.35806, 0.17491, 0.15752, -0.38974, 0.15785, -0.11237, 0.16996, 0.07034, -0.29624, -0.20062, -0.11649, 0.23958, 0.12123, -0.05922, -0.25320, -0.15877, -0.04878, -0.17475, 0.17761, -0.17073, -0.04392, 0.49176, 0.20456, 0.71964, -0.24501, 0.23090, -0.20057, 0.29144, 0.34997, -0.47062, 0.13777, 0.08265, 0.38853, 0.02435, -0.15429, -0.17934, -0.06365, -0.12415, -0.10394, -0.18429, -0.35483, -0.21525, 0.05942, 0.19793, 0.09521, -0.24977, -0.33427, 0.04912, -0.24623, 0.43333, -0.12799, 0.28146, 0.04764, 0.21654, -0.27160, 0.61729, -0.05107, -0.30553, 0.29249, -0.15464, -0.14522, 0.24393, -0.18919, -0.43948, 0.52454, 0.22005, -0.44827, -0.39906, 0.40726, -0.21335, -0.07971, -0.28354, 0.11108, 0.01481, -0.37292, 0.17809, 0.03488, -0.13215, 0.17253, -0.11791, -0.13976, -0.23828, 0.22592, -0.41355, 0.12601, 0.83641, 0.18459, -0.00347, 0.28769, 0.10095, 0.19476, 0.50792, 0.04113, 0.16390, -0.23309, 0.42694, -0.06147, 0.07271, 0.07544, -0.04256, -0.04545, -0.27834, 0.24494, -0.16883, 0.15000, -0.08949, -0.05867, -0.00903, 0.14529, 0.24702, -0.18937, -0.28886, -0.24008, -0.04227, -0.37717, -0.28747, 0.26174, -0.35527, 0.03815, -0.00708, 0.09371, 0.29560, 0.41579, 0.13900, 0.04153, -0.12024, -0.06554, -0.11634, -0.07988, 0.20049, -0.17647, -0.00582, 0.42376, -0.15636, 0.27413, -0.11051, 0.00478, -0.30303, 0.10362, -0.27428, 0.08752, -0.15209, 0.41466, -0.04700, -0.21136, 0.11834, -0.16985, 0.20846, 0.20791, -0.12102, 0.00275, 0.24892, 0.11430, -0.20873, -0.07238, -0.05436, -0.41899, -0.62990, 0.23796, -0.06399, -0.49256, 0.19141, -0.37416, -0.34462, 0.25972, -0.51707, -0.46258, 0.10827, 0.16233, 0.08140, 0.22514, -0.26821, -0.59703, -0.03896, 0.04064, -0.24537, -0.12389, 0.16034, -0.30174, -0.15281, -0.33583, -0.25088, 0.15511, 0.08964], dtype=np.float32),
"queen": np.array([-0.30776, 0.03691, -0.18558, 0.31012, 0.11950, 0.27827, 0.55683, 0.18181, -0.58425, 0.17935, 0.64690, 0.77510, 0.28613, 0.41206, -0.37302, -0.00877, -0.04482, 0.09054, -0.34902, -0.08987, 0.21297, -0.54038, 0.03983, -0.09468, -0.39881, -0.14433, -0.32549, 0.23944, 0.03030, -0.25032, -0.08188, -0.02524, 0.01776, -0.07530, -0.03132, -0.37621, -0.01467, 0.04420, 0.18076, 0.49967, -0.46510, 0.38056, -0.03662, 0.31360, -0.12388, -0.69273, 0.45250, 0.39414, 0.10382, -0.18083, -0.29114, -0.19337, -0.04503, 0.25394, -0.22558, -0.23089, -0.25173, -0.13796, 0.10233, 0.20276, -0.04811, -0.43686, -0.44864, -0.26512, -0.06243, 0.07716, -0.12606, 0.41493, -0.30450, -0.14465, 0.21208, 0.39148, 0.03268, -0.13662, 0.19926, -0.33510, 0.21867, -0.35820, -0.18970, -0.09537, 0.12245, 0.38925, -0.38757, -0.30637, 0.54951, 0.01394, -0.08508, -0.24813, -0.10594, 0.12171, -0.50957, 0.44566, -0.08242, -0.02203, 0.21989, -0.14438, -0.05283, 0.14035, 0.26419, -0.24030, 0.22532, 1.33754, 0.37092, 0.05798, 0.30040, 0.07502, 0.34760, 0.33096, -0.04410, 0.12008, -0.04259, 0.40471, 0.02082, 0.25745, 0.17993, -0.24099, -0.06470, -0.05714, 0.27449, -0.05952, 0.18110, 0.20618, -0.27499, 0.17484, 0.28749, 0.00435, -0.16284, -0.12278, -0.19078, -0.58445, -0.14534, -0.21705, -0.09056, -0.42018, -0.43452, 0.23021, 0.12230, -0.09652, 0.26512, -0.12037, -0.07013, 0.12415, 0.07769, -0.25398, -0.01356, 0.31983, 0.00162, 0.13570, 0.63584, 0.36664, 0.58859, -0.20397, -0.03051, -0.16674, 0.16519, -0.06899, 0.28424, -0.17868, 0.24528, -0.11615, -0.05568, 0.06143, -0.24224, 0.40372, 0.27728, -0.11411, -0.19426, 0.55504, 0.20043, -0.56549, -0.28225, 0.40233, -0.52753, -0.51510, 0.23187, 0.02527, -0.11301, 0.37002, -0.23588, -0.39991, -0.04561, -0.57695, 0.13713, -0.09015, 0.20224, 0.47651, -0.11756, -0.06060, -0.42495, -0.03794, 0.19471, -0.48927, 0.05144, 0.07654, -0.36900, -0.04806, -0.13519, -0.14509, 0.39243, -0.11192], dtype=np.float32),
"man": np.array([-0.11405, 0.08982, 0.15057, 0.08777, -0.21986, -0.04806, 0.27677, -0.35154, -0.29647, 0.22280, 0.39584, 0.45867, 0.03449, 0.18667, 0.11873, -0.03747, -0.18381, 0.03401, 0.06344, 0.11479, 0.15629, -0.28236, 0.18581, 0.16097, 0.02858, -0.25373, -0.15130, 0.16892, -0.04158, -0.50097, 0.41814, 0.04884, 0.03952, 0.02794, 0.06206, -0.39863, 0.00382, 0.09441, 0.04298, 0.09295, -0.06042, -0.09560, -0.00657, -0.19870, -0.09326, -0.55155, -0.07719, 0.00306, 0.03371, 0.17530, 0.00693, 0.04368, 0.14678, -0.07584, -0.39679, -0.40624, 0.08573, -0.13649, 0.65182, -0.02761, 0.38170, 0.15837, 0.10637, -0.04350, -0.36671, 0.31567, 0.18769, 0.34040, -0.14083, 0.33893, -0.22524, 0.40489, -0.02898, -0.50082, 0.31294, -0.66268, 0.19765, 0.06656, -0.01488, -0.01797, 0.41898, -0.12429, -0.15202, -0.48976, 0.22793, 0.15720, -0.21968, 0.15553, 0.00309, 0.06632, -0.41443, 0.32485, -0.23725, -0.16549, -0.11508, 0.02983, 0.21747, 0.06538, 0.01130, -0.24656, -0.06608, 0.32183, 0.06266, -0.13059, 0.10060, -0.06931, 0.12669, 0.41218, 0.27707, 0.18196, 0.21397, 0.60289, -0.04913, 0.18614, -0.04532, 0.02118, 0.00050, -0.06390, -0.01631, 0.16685, 0.40851, -0.10562, -0.00391, -0.14963, 0.23022, 0.05503, 0.13441, -0.35875, -0.32551, -0.00056, 0.17382, -0.29965, 0.01298, -0.23947, -0.08002, -0.00086, 0.16424, 0.05363, 0.01317, 0.13681, 0.04181, -0.13414, -0.00148, -0.18891, 0.22703, 0.34665, -0.06275, 0.01458, -0.00501, 0.32213, 0.16952, 0.01258, -0.05818, 0.02961, -0.06529, -0.31424, -0.21646, -0.05419, 0.08787, -0.03145, 0.07413, 0.07875, -0.05121, -0.06535, -0.24616, 0.00340, 0.04033, 0.21205, 0.16022, -0.05507, 0.20916, 0.02536, -0.19208, -0.02454, 0.13160, 0.02777, -0.16148, -0.11947, -0.05418, -0.24062, 0.03667, 0.23928, -0.08993, -0.28687, -0.00904, 0.31650, 0.47483, -0.04892, -0.26943, -0.17197, 0.16030, -0.12785, -0.46808, 0.06121, -0.16568, -0.02462, -0.04194, -0.05533, -0.05638, -0.14601], dtype=np.float32),
"woman": np.array([-0.20223, 0.40869, -0.01241, -0.06522, 0.01981, 0.53251, 0.51102, -0.28303, -0.10660, 0.38697, 0.10596, 0.68116, 0.10824, 0.41259, -0.02591, -0.00141, 0.24525, 0.30996, 0.17783, 0.37466, 0.17955, -0.21467, 0.19712, 0.00547, 0.18816, -0.32502, -0.15192, 0.21070, -0.33975, -0.25808, 0.22532, 0.02078, -0.11179, -0.41784, 0.13709, -0.52777, -0.27931, -0.35331, -0.10013, 0.42053, -0.25775, 0.17448, 0.03469, 0.30621, -0.11002, -0.46523, -0.07408, -0.04986, 0.28117, 0.17014, -0.09855, 0.29852, 0.16785, -0.08486, -0.55263, -0.82880, 0.02791, -0.15857, 0.42570, 0.07836, 0.09784, -0.03557, 0.22153, -0.22919, -0.29406, 0.14334, 0.23074, 0.31887, -0.48970, 0.04008, 0.37743, 0.22198, -0.22764, -0.31246, 0.42123, -0.68858, 0.35948, 0.05508, -0.18357, -0.39036, 0.44232, 0.04077, 0.05094, -0.50568, 0.49748, 0.24088, -0.37117, 0.10632, -0.21735, 0.07567, -0.44182, 0.70356, -0.13529, -0.15779, -0.11706, -0.30445, 0.23941, 0.06640, -0.04049, 0.04920, 0.10792, 0.52791, 0.31310, -0.37483, 0.15181, 0.10099, 0.06929, 0.23287, 0.57277, -0.17379, 0.61427, 0.43647, 0.12352, 0.08537, 0.24576, -0.13813, -0.11501, -0.19876, -0.23390, 0.38711, 0.45066, -0.04466, -0.02871, 0.03768, 0.14445, -0.13899, 0.44499, -0.32080, -0.43134, -0.11004, -0.13632, -0.13255, 0.06400, -0.21071, -0.33964, 0.35397, 0.39092, -0.04286, -0.12524, 0.02801, -0.23665, -0.24559, 0.09938, -0.01927, -0.15342, 0.30425, -0.10204, -0.20258, -0.09698, 0.46590, 0.23120, -0.10868, -0.08905, -0.05100, 0.29452, -0.26041, -0.17525, 0.10899, 0.21647, -0.24245, 0.26683, 0.08385, 0.04004, 0.13849, -0.40952, 0.18126, -0.25566, 0.37063, 0.04051, -0.28076, -0.05131, 0.46176, -0.26021, -0.38150, 0.12434, -0.18730, -0.06230, 0.04172, 0.02004, -0.18781, 0.01769, 0.27388, -0.12347, -0.17349, -0.23045, 0.34026, -0.12734, -0.09767, -0.22604, -0.12463, 0.02584, -0.28284, -0.19144, -0.03827, -0.26933, 0.18254, -0.15987, -0.52381, 0.10156, -0.43945], dtype=np.float32),
"paris": np.array([-0.18519, -0.01360, -0.32597, 0.45632, 0.30088, 0.20468, -0.27163, 0.25304, -0.39749, 0.07425, 0.63972, 0.69581, -0.19967, 0.19906, -0.05836, 0.01783, -0.19658, 0.49463, -0.43111, -0.59413, 0.34516, 0.14377, -0.13093, 0.10133, 0.00088, -0.35633, 0.20741, 0.06952, -0.16522, 0.18396, 0.03026, 0.04270, -0.38351, -0.28466, 0.31205, -0.18801, -0.06391, -0.38938, -0.20132, 0.33786, -0.26304, 0.00999, 0.29849, 0.04288, -0.32341, -0.31335, -0.00479, -0.14309, 0.11750, -0.11457, 0.36283, -0.18785, 0.12942, 0.13227, -0.31007, -0.05142, 0.05190, -0.05691, 0.26242, 0.39579, 0.51209, 0.16048, -0.13767, -0.36448, -0.34478, 0.41142, -0.21112, 0.59158, -0.17962, 0.25776, 0.38184, 0.70665, 0.33705, -0.03129, 0.56412, -0.38605, -0.17437, 0.07264, -0.32980, -0.30586, 0.46573, 0.41541, -0.04550, -0.79480, 0.36952, -0.05179, -0.02684, -0.26586, -0.21004, -0.24198, -0.66758, 0.51680, 0.00700, -0.47347, -0.63521, -0.46611, 0.24545, 0.24413, -0.33283, -0.29844, 0.10242, 0.71519, 0.20355, -0.09181, 0.43851, 0.31301, -0.21880, 0.11714, 0.13327, -0.08256, 0.13364, -0.25063, -0.47738, -0.08893, 0.09649, -0.57574, 0.19138, -0.14103, 0.51915, 0.18851, 0.03010, 0.09958, -0.48569, 0.13306, 0.02912, 0.01384, 0.25923, 0.10225, -0.09040, -0.32389, -0.24441, -0.09959, 0.15180, -0.23231, 0.05933, 0.13196, 0.57919, -0.37415, 0.15076, -0.00473, -0.00915, -0.21773, 0.00185, -0.31442, -0.05701, -0.16205, -0.20571, -0.30739, 0.34166, 0.32831, 0.24891, -0.10780, 0.35190, 0.49997, -0.07121, -0.06739, 0.23996, -0.51162, 0.35696, -0.19288, -0.06150, -0.00211, -0.10440, -0.20337, 0.21916, -0.24917, -0.10438, 0.46726, 0.09295, -0.61664, -0.01790, -0.40113, -0.32674, -0.45269, -0.05442, -0.18958, -0.17513, -0.12574, 0.05589, -0.12903, 0.15018, -0.24348, 0.12981, -0.05724, 0.02806, 0.42176, -0.12478, -0.03598, -0.29495, 0.05280, 0.30015, -0.06785, -0.03343, -0.06424, 0.15256, -0.49415, -0.15664, -0.56440, -0.16290, -0.42638], dtype=np.float32),
"france": np.array([-0.18219, 0.06323, 0.13155, 0.19178, 0.45262, 0.07333, 0.00745, 0.13176, -0.04027, 0.18680, 0.58815, 0.72726, -0.20907, 0.29821, 0.09369, -0.04996, -0.12140, 0.24599, -0.28084, -0.37325, 0.61308, 0.02491, 0.15377, 0.35447, -0.27714, -0.10280, 0.42167, 0.28323, 0.08624, 0.04554, -0.11203, -0.09141, -0.15559, -0.12945, 0.17985, 0.05397, -0.20674, -0.00157, 0.02078, 0.64305, 0.10637, 0.12813, -0.03183, 0.26590, -0.00545, -0.20655, 0.19704, -0.04977, -0.16935, 0.06854, -0.15001, -0.07362, 0.09009, 0.10963, -0.45723, -0.00408, 0.08126, -0.33670, 0.11713, 0.31598, 0.20151, 0.11635, -0.11566, -0.40446, -0.30366, 0.06843, 0.03220, 0.33058, 0.09078, 0.13582, 0.28371, 0.38705, 0.27897, 0.09391, 0.26882, -0.26058, 0.00064, -0.21219, -0.12738, -0.40399, 0.40183, 0.39339, 0.08117, -0.32321, 0.22676, 0.13104, -0.10246, -0.52784, -0.25873, -0.09247, -0.37017, 0.41755, -0.14285, 0.15307, -0.35519, -0.03849, 0.13401, 0.13112, -0.35291, -0.36664, 0.03136, 0.80553, 0.41274, 0.01832, 0.22198, 0.05979, -0.00397, 0.11959, -0.02017, 0.50051, 0.07584, -0.12657, -0.01628, -0.00025, 0.09829, -0.26941, 0.02597, -0.12949, 0.27935, -0.23689, -0.01785, 0.19605, -0.41837, 0.07977, -0.05644, 0.01962, 0.20559, -0.07866, -0.28765, -0.26882, -0.10859, 0.15512, -0.00888, -0.09430, -0.03187, 0.10582, 0.19566, 0.09776, 0.06356, 0.08710, -0.15624, 0.04377, 0.07377, -0.17253, 0.03722, -0.02852, -0.15565, -0.19175, 0.32213, 0.44915, 0.32993, -0.63005, 0.23579, 0.14376, 0.12855, 0.14385, 0.16940, -0.47164, 0.24567, -0.18858, -0.03186, -0.09289, -0.61278, -0.16884, 0.22779, -0.02604, 0.23900, 0.04567, 0.08428, -0.32738, -0.19242, -0.10240, -0.27779, -0.24567, 0.24114, -0.39621, -0.44449, 0.20193, -0.14436, 0.03190, 0.38711, -0.41903, -0.11760, 0.08490, 0.23199, 0.61604, -0.13850, -0.01056, -0.50419, 0.36371, 0.59154, -0.23931, -0.13358, 0.17867, -0.08431, -0.26031, 0.18438, -0.27460, -0.15105, -0.13703], dtype=np.float32),
"rome": np.array([-0.11204, 0.22966, -0.13668, 0.42749, 0.39183, 0.47309, 0.25666, 0.07822, -0.24798, -0.13575, 0.46827, 0.06815, -0.17023, 0.39090, -0.08079, -0.29234, -0.32003, 0.23166, -0.12039, -0.17116, 0.21441, -0.35751, -0.18170, -0.29657, -0.07359, -0.20686, 0.12989, 0.12463, -0.10202, -0.23102, -0.04400, -0.12517, -0.03459, -0.12223, 0.31041, -0.14974, -0.39004, -0.13628, 0.06586, 0.34412, 0.04909, 0.02732, 0.04259, -0.23218, 0.16646, -0.36989, -0.24263, -0.07604, 0.09067, 0.02604, -0.23639, -0.04283, 0.42932, 0.21634, -0.40271, 0.07764, 0.13056, -0.06083, 0.19618, 0.54013, 0.14888, -0.24765, -0.25302, -0.30739, 0.01029, 0.29831, 0.07945, 0.45129, 0.22313, 0.07088, 0.33073, 0.53051, -0.02538, -0.34967, 0.07828, 0.00247, -0.39456, -0.02917, -0.15832, -0.11131, 0.80995, 0.23170, 0.13627, -0.43837, 0.42694, -0.17925, -0.19091, -0.15370, -0.09075, 0.00463, -0.25978, 0.44907, 0.12163, -0.10187, -0.29649, -0.16680, -0.16898, -0.18889, -0.05334, -0.68051, -0.21924, 1.08901, -0.08938, -0.01084, 0.64356, 0.07706, -0.38139, 0.38721, 0.38537, 0.06449, 0.02829, -0.11856, -0.18428, 0.23894, 0.32629, -0.79727, -0.02919, -0.08177, 0.06304, -0.30370, 0.06447, -0.09200, -0.16051, 0.16706, 0.28596, 0.24376, 0.01903, -0.24568, -0.15003, 0.03472, -0.18781, -0.06927, 0.09267, 0.14936, -0.16940, -0.02855, 0.08061, -0.12080, 0.02909, 0.04295, 0.25802, -0.37922, 0.32467, -0.24179, 0.10241, -0.42759, -0.26248, -0.39167, -0.29641, 0.28513, 0.25339, -0.46392, 0.38545, 0.01698, -0.02567, -0.01231, -0.10522, -0.01214, 0.57213, -0.07360, -0.17799, 0.11318, -0.23048, 0.37151, -0.08626, 0.04074, 0.04089, 0.59131, 0.11168, -0.08926, 0.06659, -0.22881, -0.31912, -0.39018, -0.18094, -0.27141, -0.33712, -0.07491, -0.19420, -0.13975, 0.37040, 0.00748, -0.23212, -0.27160, 0.00709, 0.19954, -0.19594, -0.06819, -0.33766, 0.28512, 0.11063, -0.17811, -0.39012, -0.00868, 0.18033, 0.10791, -0.16924, -0.43922, -0.42524, -0.02648], dtype=np.float32),
"italy": np.array([-0.15340, 0.01927, 0.16926, 0.41839, 0.53047, 0.17460, 0.39970, -0.03224, -0.20870, 0.22605, 0.59484, 0.17090, -0.38235, 0.31133, 0.20538, -0.04493, -0.27729, 0.32064, -0.05734, -0.34299, 0.50433, 0.03479, 0.20769, 0.13809, 0.01233, -0.25943, 0.10879, 0.46709, -0.00929, -0.26202, -0.30844, -0.18392, -0.03692, 0.00350, -0.09822, -0.10232, -0.51924, -0.31376, -0.10353, 0.38236, 0.19212, -0.14145, 0.08053, 0.05962, -0.30530, -0.51329, -0.32229, -0.07294, -0.26528, 0.22517, -0.12624, 0.25401, 0.08901, 0.04630, -0.25246, -0.51830, 0.02983, -0.28021, 0.20391, 0.49642, 0.14568, 0.01941, -0.07164, -0.17974, -0.27320, 0.35721, 0.07217, 0.40795, 0.07221, 0.00986, 0.13876, 0.50403, 0.01540, 0.03218, 0.25567, -0.27005, -0.25822, -0.18744, -0.27059, -0.21146, 0.45264, 0.23391, 0.03439, -0.15279, 0.33551, 0.13907, -0.30575, -0.41350, -0.33525, -0.19169, -0.60787, 0.42002, 0.04735, 0.07948, -0.24537, 0.16873, -0.01630, -0.21978, -0.19681, -0.31645, 0.14211, 0.90986, 0.27263, 0.08888, 0.58419, 0.08633, 0.18484, -0.02439, 0.22421, 0.22149, -0.06583, -0.32640, -0.54394, 0.03884, 0.01773, -0.72140, 0.05312, 0.15851, 0.34632, -0.33199, 0.07043, 0.31847, -0.33012, 0.02117, -0.03411, 0.11081, 0.16380, 0.03609, -0.28953, -0.04240, -0.17554, 0.11706, -0.06782, 0.06808, -0.40821, -0.01016, -0.00167, -0.04228, 0.16646, 0.24710, -0.04438, 0.01156, 0.25375, -0.14640, 0.04143, -0.34428, -0.22431, -0.47259, 0.16204, 0.67493, 0.23886, -0.45043, 0.26658, 0.04719, -0.21568, -0.09716, 0.27766, -0.14342, 0.48150, -0.17313, -0.15145, 0.20802, -0.69512, -0.02889, -0.02191, 0.02746, 0.05462, 0.20904, -0.04408, -0.51693, -0.09057, -0.19239, -0.23878, -0.43717, 0.11679, -0.19186, -0.32573, 0.14074, -0.16839, 0.16045, 0.46155, -0.07977, -0.41751, -0.02728, 0.00373, 0.53923, -0.16084, 0.02926, -0.49980, 0.07807, 0.39652, -0.47138, -0.22931, -0.16463, -0.12133, -0.06451, 0.08685, -0.50419, -0.06190, -0.04058], dtype=np.float32),
"cat": np.array([0.26959, 0.28885, -0.25441, 0.29609, -0.17498, 0.61828, 0.42658, -0.33028, -0.55416, 0.24162, 0.11672, 0.56325, 0.49905, 0.27891, 0.03885, -0.33034, -0.24380, 0.55970, -0.33500, 0.15236, 0.19751, -0.33009, -0.08106, 0.41318, 0.11463, -0.15276, -0.18670, 0.41262, -0.41940, -0.57349, 0.51915, -0.35298, 0.08269, -0.37676, -0.10860, -0.47629, -0.02687, 0.15058, 0.06812, -0.08056, -0.11488, -0.29507, 0.26964, 0.18912, 0.12296, -0.73252, 0.54527, 0.03410, -0.13766, -0.18625, -0.02463, -0.15198, -0.15954, 0.22369, -0.21203, -0.32467, 0.21283, -0.24604, 0.23843, 0.30972, 0.29833, -0.47957, 0.36598, 0.08849, -0.81814, 0.47101, 0.33824, 0.43682, -0.43284, 0.14764, 0.50940, 0.14663, 0.08716, -0.53612, 0.69195, -0.67798, 0.12627, 0.19669, -0.23032, 0.12126, -0.05029, -0.23783, -0.24113, 0.08685, 0.41477, 0.21425, -0.01235, -0.14322, -0.22114, -0.06646, -0.58032, 0.73164, -0.04452, -0.47726, -0.40563, 0.44273, 0.02877, 0.12494, 0.27107, -0.00345, 0.08363, 0.24003, 0.48798, -0.19554, 0.28059, -0.34323, -0.09547, 0.27878, -0.00391, 0.07340, -0.01744, 0.31574, -0.04113, 0.39985, -0.29742, -0.32761, 0.09009, -0.37477, 0.29915, 0.34550, 0.61891, 0.10591, 0.12979, 0.21423, 0.44679, -0.33396, 0.38823, 0.15324, -0.20648, 0.12805, 0.27564, -0.31365, 0.45786, -0.13229, 0.02166, -0.27080, 0.04489, -0.39346, 0.21707, -0.00268, 0.13500, 0.21508, 0.29841, -0.78957, -0.13894, 0.00594, -0.44211, 0.09468, 0.24073, 0.26808, 0.68818, 0.25129, 0.02253, 0.39138, 0.16902, -0.31680, 0.86300, -0.45429, 0.03778, -0.43180, -0.41733, 0.18255, -0.57203, 0.74918, -0.34643, 0.17222, -0.12939, 0.03335, -0.09112, -0.47202, 0.05456, -0.32359, -0.36980, 0.03217, 0.65667, 0.29583, -0.12857, -0.66626, -0.32521, -0.63238, 0.39975, 0.12622, -0.45418, 0.06869, -0.01442, 0.31165, -0.39287, 0.24561, -0.44056, -0.07728, 0.10282, -0.40999, -0.49304, -0.30354, 0.27296, 0.48230, -0.25617, -0.51907, 0.45467, -0.10298], dtype=np.float32),
"dog": np.array([0.22407, 0.19778, -0.12492, 0.02943, -0.01023, 0.30840, 0.47067, -0.09805, -0.24328, 0.48196, -0.05229, 0.67571, 0.28807, 0.28325, -0.07169, 0.07745, -0.50867, 0.33654, -0.31787, 0.21052, -0.05775, -0.40865, 0.29858, 0.16141, -0.22317, -0.16911, -0.06106, 0.17131, -0.45845, -0.23939, 0.02698, 0.24134, 0.10973, -0.17744, -0.03557, -0.17743, 0.09569, 0.08071, 0.49728, -0.24340, -0.02540, 0.07004, 0.41042, -0.03981, 0.17039, -0.64987, 0.03488, 0.04470, 0.07692, 0.08220, 0.02484, -0.28589, 0.37340, -0.10491, -0.32158, -0.33000, -0.21721, -0.07934, 0.43070, 0.18908, 0.37931, -0.13734, 0.39928, -0.08460, -0.66089, 0.37540, 0.46753, 0.48962, -0.21238, 0.11802, 0.26598, 0.01917, 0.59764, -0.42530, 0.65706, -0.36850, 0.22453, 0.14608, -0.08320, -0.04758, 0.15790, -0.61434, 0.05107, -0.15172, 0.09765, -0.14117, -0.34464, 0.08492, -0.23829, -0.07973, -1.01947, 0.40416, -0.21399, -0.47815, -0.16949, -0.08467, 0.14683, -0.06763, 0.24079, 0.40548, -0.28201, 0.33629, -0.06622, -0.39812, 0.27318, 0.21343, 0.22216, 0.69422, 0.02870, 0.14304, -0.15980, 0.75023, -0.00934, -0.02005, -0.18365, 0.21365, 0.05600, -0.39873, 0.36346, 0.08854, 0.39175, -0.05878, -0.10227, 0.05216, 0.30803, -0.17737, 0.23768, 0.18228, -0.17957, -0.11890, -0.23524, -0.37678, 0.17571, -0.24524, -0.25780, -0.02954, 0.08782, -0.07055, 0.11858, 0.17058, 0.09321, 0.04601, 0.01212, -0.47209, -0.34194, 0.14436, -0.80984, 0.35469, 0.50964, -0.37567, 0.79713, 0.03317, -0.02606, -0.17922, -0.14554, -0.22435, 0.17604, -0.42443, -0.06444, 0.05082, -0.42109, -0.40434, -0.31260, 0.26361, -0.00811, 0.14489, 0.23344, 0.20476, 0.09684, -0.40022, -0.07501, -0.28960, -0.49280, -0.43434, 0.45920, 0.08183, 0.01575, -0.15572, -0.47815, -0.16507, 0.31122, 0.47231, -0.34021, -0.26883, 0.02138, 0.38776, 0.20281, -0.10099, -0.55102, -0.15751, 0.29602, -0.26543, -0.40633, -0.05946, -0.14480, -0.35555, 0.14056, -0.35140, 0.20284, 0.17339], dtype=np.float32),
"table": np.array([-0.24251, -0.02881, -0.04184, 0.11320, 0.07676, 0.31765, 0.04621, 0.27889, 0.08690, -0.43657, 0.32255, 0.05912, -0.05635, -0.20437, -0.13043, -0.19339, -0.05859, 0.63758, -0.51283, 0.17039, 0.11268, -0.09266, 0.09368, -0.25822, -0.08813, 0.00769, -0.36577, 0.60987, 0.13547, -0.30731, 0.10317, -0.39184, -0.13124, 0.07061, 0.08820, -0.38111, -0.17262, -0.10532, -0.14675, -0.21426, -0.32254, 0.16817, 0.04332, -0.00583, -0.05811, -0.72103, -0.35971, -0.07069, -0.20500, -0.53578, 0.06537, 0.01355, -0.03441, 0.16508, -0.41717, -0.48964, -0.02400, 0.08731, 0.23961, 0.56417, -0.06052, -0.19098, 0.24184, -0.14215, -0.08555, 0.35112, -0.05380, -0.16210, 0.18068, 0.14535, 0.34078, 0.81779, 0.13351, -0.44080, 0.17847, -0.33536, -0.02692, 0.23795, -0.48981, 0.27085, 0.21318, -0.26794, -0.47646, -0.47865, 0.43042, 0.12462, -0.23047, -0.00634, -0.44237, -0.54118, -0.42910, 0.51173, 0.15166, 0.06616, -0.00050, 0.04257, 0.38788, 0.20401, 0.14725, -0.34527, 0.26782, 0.41257, 0.26215, -0.13500, -0.25081, 0.19053, 0.11008, 0.48502, 0.03358, -0.07010, 0.21036, 0.02110, -0.75852, 0.17353, -0.04961, 0.08930, -0.22679, -0.26382, 0.08666, 0.02339, -0.04276, 0.14007, 0.20233, -0.22444, -0.13015, -0.26452, 0.22854, -0.19656, 0.08982, -0.11065, -0.01083, -0.14230, 0.55542, -0.40077, 0.43193, -0.01451, 0.42432, 0.15159, 0.09869, -0.31474, -0.09218, 0.34171, 0.00812, 0.04699, -0.58361, 0.44378, -1.12451, 0.19731, 0.13642, 0.32610, -0.03793, -0.12543, 0.33448, 0.08818, -0.39563, 0.49534, -0.14100, -0.56391, 0.18019, -0.09507, -0.28584, -0.23887, -0.15002, 0.24602, -0.17264, -0.82555, -0.39407, 0.01208, 0.13483, -0.50127, 0.43927, -0.47681, -0.05441, -0.22195, 0.20945, -0.32043, -0.16494, -0.12170, -0.05999, 0.00696, -0.03224, 0.15639, -0.50188, -0.17326, 0.19615, 0.53008, -0.08557, 0.25442, -0.23749, -0.42338, 0.22123, 0.12498, -0.11213, -0.41178, -0.18820, -0.14456, -0.05304, -0.37926, 0.32974, -0.46066], dtype=np.float32),
"actor": np.array([-0.34081, 0.29496, 0.06248, 0.02581, 0.28570, 0.21760, 0.21162, -0.15345, -0.47740, 0.71132, 0.16290, 0.47373, 0.12448, 0.37240, 0.18513, 0.40150, -0.50863, 0.52983, -0.31956, 0.19327, 0.35366, -0.28345, 0.34000, -0.19229, 0.34402, -0.67792, 0.04072, 0.53878, -0.02601, -0.50897, -0.09529, -0.11532, -0.27105, -0.36717, -0.15617, -0.63148, -0.05268, 0.25237, -0.25513, 0.79959, -0.33079, -0.21113, -0.04093, 0.24306, -0.24102, -0.49024, -0.31635, 0.51906, 0.08259, -0.07093, 0.05762, -0.31068, 0.19975, -0.29971, 0.24759, -0.54989, -0.47743, -0.41719, 0.47868, 0.89240, 0.28170, -0.01673, 0.02956, -0.14534, -0.33202, -0.18015, -0.02318, 0.19072, -0.12974, 0.05497, 0.18587, 0.23634, 0.20169, -0.04462, 0.04632, -0.17732, 0.27773, 0.26587, -0.70917, -0.15072, 0.20083, -0.30606, 0.12365, -0.91372, 0.38745, 0.44369, -0.45890, -0.62393, -0.87489, -0.12528, -0.84243, 0.71368, -0.13824, -0.30721, -0.45125, -0.02410, -0.08495, 0.14208, 0.21343, 0.11456, 0.57746, 0.51518, 0.74713, -0.44282, 0.46277, 0.26860, 0.26077, 0.23483, -0.19523, -0.11328, -0.28465, 0.57823, -0.35354, 0.14874, -0.03730, 0.06788, -0.03338, -0.14657, 0.87178, 0.28849, 0.43849, 0.27989, -0.17953, 0.43501, -0.30121, -0.32942, -0.35602, -0.24987, -0.28101, -0.39073, 0.21141, -0.08250, 0.44348, -0.14257, 0.11083, 0.42711, 0.48078, -0.40088, 0.46692, 0.35036, -0.07102, -0.33256, 0.15478, -0.21769, -0.43711, 0.06046, -0.09511, -0.13849, -0.45530, 0.62191, 0.30547, -0.24135, -0.22935, 0.14534, 0.24329, -0.02397, 0.53990, -0.10288, 0.24224, -0.39644, -0.42022, 0.20305, 0.10345, -0.29594, -0.10168, 0.30617, 0.27342, 0.73330, 0.33558, -0.46253, 0.52831, -0.53941, -0.71974, -0.17000, 0.33471, 0.31247, -0.39216, 0.04075, -0.16177, -0.56716, 0.34696, -0.04393, -0.14483, -0.43536, 0.20816, 0.60463, 0.20165, -0.21832, -0.35554, -0.41931, -0.18691, -0.31201, -0.52478, 0.25814, -0.37204, -0.06317, -0.00777, -0.26136, 0.12635, -0.34427], dtype=np.float32),
"actress": np.array([-0.03609, 0.48259, 0.10655, 0.15476, 0.74611, 0.35226, 0.26543, -0.09824, -0.74194, 0.80526, 0.48950, 0.66349, -0.04611, 0.19693, 0.01656, 0.38376, -0.47511, 0.58739, -0.42210, 0.36695, 0.23638, -0.13260, 0.11278, -0.07940, 0.65258, -0.78240, -0.19338, 0.22543, 0.03659, -0.34090, -0.35696, -0.05129, -0.15844, -0.55043, -0.11435, -0.78682, -0.05935, 0.09748, -0.36884, 0.66672, -0.33169, -0.22609, -0.27201, 0.42125, -0.33582, -0.50228, -0.41939, 0.29667, 0.25257, 0.07340, 0.04644, -0.31010, 0.50941, -0.25233, 0.04330, -0.33538, -0.34326, -0.35768, 0.31392, 0.94899, -0.00139, -0.22785, -0.27498, 0.01661, -0.42216, -0.27224, 0.08699, 0.09866, -0.15192, 0.05982, 0.09475, 0.45494, 0.09575, -0.27159, -0.07097, -0.53471, 0.38938, -0.05162, -0.69329, -0.03841, 0.18119, 0.01944, 0.06560, -1.02032, 0.21652, 0.43684, -0.66284, -0.68760, -0.74737, -0.10185, -0.65727, 0.47983, -0.04231, -0.61117, -0.52429, -0.08394, -0.11283, 0.16290, 0.03889, 0.29620, 0.63489, 0.61709, 0.62876, -0.45758, 0.09166, 0.04892, 0.30382, 0.02665, -0.12999, -0.27397, -0.28191, 0.82811, -0.17342, 0.15485, 0.21338, -0.16891, -0.08952, -0.11762, 0.96667, 0.28835, 0.29522, 0.17668, -0.52001, 0.31824, -0.65100, -0.52572, -0.23879, 0.13706, -0.26424, -0.37772, 0.33564, -0.39051, 0.37396, -0.41123, -0.19594, 0.23036, 0.26708, -0.66899, 0.37358, 0.33660, -0.41522, -0.12397, 0.26031, -0.21327, -0.54083, -0.04208, -0.34020, 0.04574, -0.43365, 1.02148, 0.61688, -0.26216, -0.53543, 0.11807, 0.18984, -0.05739, 0.32445, -0.13142, 0.56787, -0.50690, -0.35184, 0.04989, 0.07077, -0.01400, 0.07540, 0.44800, 0.13940, 0.76294, 0.13431, -0.64686, 0.21227, -0.18815, -0.90020, -0.28664, 0.35314, 0.40869, -0.25572, 0.25900, 0.17797, -0.81918, 0.12856, -0.18146, 0.12532, -0.24020, 0.36360, 0.87130, -0.06650, -0.24994, -0.14889, -0.34492, -0.04197, -0.29965, -0.32774, 0.17399, -0.36773, 0.09036, -0.06386, -0.28132, 0.17017, -0.33808], dtype=np.float32),
}
vocab = list(dict_string2dist.keys())
N = len(vocab)
# f: produces distributional embedding representations in R^D given localist one-hot representation in R^N
@jaxtyped(typechecker=beartype)
def f(x_N: Float[np.ndarray, "N"]) -> Float[np.ndarray, "D"]:
# E_DV = np.stack([dict_string2dist[i] for i in vocab], axis=1)
E_ND = np.stack([dict_string2dist[i] for i in vocab])
y_D = E_ND.T @ x_N
return y_D
xonehot_N = np.eye(N)[vocab.index("man")] # row-wise plucking basis vector from identity matrix I
yembedding_D = f(xonehot_N)
print("xonehot_N.shape", xonehot_N.shape)
print("xonehot_N.storage", xonehot_N)
print("f(xonehot_N) => yembedding_D.shape", yembedding_D.shape)
print("f(xonehot_N) => yembedding_D.storage", yembedding_D)
assert np.allclose(yembedding_D, dict_string2dist["man"])
# f_batched: produces batched distributional embedding representations in R^{BxD} given batched localist one-hot representation in R^{BxN}
@jaxtyped(typechecker=beartype)
def f_batched(X_BN: Float[np.ndarray, "B N"]) -> Float[np.ndarray, "B D"]:
E_ND = np.stack([dict_string2dist[i] for i in vocab])
y_BD = X_BN @ E_ND
return y_BD
Xonehot_BN = np.eye(N)[[vocab.index("king"), vocab.index("man"), vocab.index("woman")]]
Yembedding_BD = f_batched(Xonehot_BN)
print("Xonehot_BN.shape", Xonehot_BN.shape)
print("Xonehot_BN.storage", Xonehot_BN)
print("f_batched(Xonehot_BN) = Yembedding_3D.shape", Yembedding_BD.shape)
print("f_batched(Xonehot_BN) = Yembedding_3D.storage", Yembedding_BD)
assert np.allclose(Yembedding_BD[0], dict_string2dist["king"])
assert np.allclose(Yembedding_BD[1], dict_string2dist["man"])
assert np.allclose(Yembedding_BD[2], dict_string2dist["woman"])
Continuous space language models have recently demonstrated outstanding results across a variety of tasks. In this paper, we examine the vector-space word representations that are implicitly learned by the input-layer weights. We find that these representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset. This allows vector-oriented reasoning based on the offsets between words. For example, the male/female relationship is automatically learned, and with the induced vector representations, “King - Man + Woman” results in a vector very close to “Queen.”
following the mikolov paper..we can ask questions.. if x is a linear combination..
vector space vector space Grant Sanderson’s 3Blue1Brown Essence of Linear Algebra Chapter 16: Abstract Vector Spaces
abbrev VectorSpace (K V : Type) [Field K] [AddCommGroup V] := Module K V
linear combinationlinear combinationGrant Sanderson’s 3Blue1Brown Essence of Linear Algebra Chapter 2: Linear Combinations, Span, and Basis Vectors
def is_linear_combination (S : Set V) (x : V) : Prop :=
∃ (s : Finset V) (f : V → K), (↑s ⊆ S) ∧ (x = Finset.sum s (fun v => f v • v))
spanspan
linear independancelinear independance
basisbasis
standard basisstandard basis
rankrank
dimensiondimension
(show there is no exact linear combination)
(so we need to induce a geometry (length, distance, and angle)) with a norm norms in R^n can be induced with inner products (but not all) inner productinner product inner product spaceinner product space
class InnerProductSpace_v (V : Type) [AddCommGroup V] [VectorSpace ℂ V] where
inner : V → V → ℂ
inner_self_im_zero : ∀ (v : V), (inner v v).im = 0
inner_self_nonneg : ∀ (v : V), 0 ≤ (inner v v).re
inner_self_eq_zero : ∀ (v : V), inner v v = 0 ↔ v = 0
inner_add_left : ∀ (u v w : V), inner (u + v) w = inner u w + inner v w
inner_smul_left : ∀ (a : ℂ) (v w : V), inner (a • v) w = a * inner v w
inner_conj_symm : ∀ (v w : V), inner v w = conj (inner w v)
normnorm
lengthlength
distancedistance
matrix vector multiplicationmatrix multiplication
function matrixfunction matrix
2.2 Predicting Scalars by Linearly Modeling Regression
Similarly to learning distributions from data, learning functions from data roughly follows the three step process of selecting a model for the task , a performance measure , and optimizing such measure on experience . Let’s take a look at the linearly modeling regression with squared error loss, a problem which straddles tools from all three languages of probability theory, linear algebra, and calculus.
For the task of house price prediction, the input space is the size in square feet, and the output space is the price, meaning that the function which needs to be recovered has the type of . An assumption about the structure of the data needs to be made, referred to as the inductive bias. The simplest assumption to make is to assume that there exists some linear relationship between the data and parameters so that ends up being modeled as an inner product plus bias, parameterized as :
One small adjustment we can make to is to fold the final bias term into the vector as , increasing the total dimensionality so that , and adjusting accordingly so that and . Then, the equation of our line with bias is simply parameterized as
Semantically speaking, each property of the input house is weighted by some weight , adjusted accordingly during the learning process so that it accurately reflects the property’s influence on the output price when evaluating a prediction . We can evaluate on all with a randomly intialized to ensure we’ve wired everything up correctly.
However, while specifying the equation of the linear regression model on paper screen
and subsequently evaluating the computation manually might have been sufficient
for pre-computational mathematicians such as Gauss and Legendre,
the discipline of statistical learning is entirely predicated on computational methods.
Let’s continue to use PyTorch’s core n-dimensional array datastructure with torch.Tensor,
this time modeling a function rather than some distribution :
(todo: maybe change example to closure following torch.nn)
import torch
def f(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
return torch.dot(x, w)
in which we can update the names of our torch.Tensor variables with the ranks of their vector
spacesReferred to as Shape Suffixes, described by Noam Shazeer, an engineer from Google
to clarify what dimensions these operations are evaluated on:
import torch
def f_shapesuffixed(x_D: torch.Tensor, w_D: torch.Tensor) -> torch.Tensor:
return torch.dot(x_D, w_D)
and finally, actually enforce them via runtime typechecking with the jaxtyping package
(try swapping the return statements below to trigger a type error with the return type):
import torch
from jaxtyping import Float, jaxtyped
from beartype import beartype
@jaxtyped(typechecker=beartype)
def f_typechecked(
x_D: Float[torch.Tensor, "D"],
w_D: Float[torch.Tensor, "D"]
) -> Float[torch.Tensor, ""]:
# return X_ND
return torch.dot(x_D, w_D)
Let’s now sanity-check that f_typechecked is wired up correctly
by evaluating it on all with random parameter w_D.
(todo: use actual housing data)
import torch
if __name__ == "__main__":
X_ND = torch.tensor([
[1500, 10, 3, 0.8], # house 1
[2100, 2, 4, 0.9], # house 2
[800, 50, 2, 0.3] # house 3
], dtype=torch.float32)
Y_N = torch.tensor([500000, 800000, 250000], dtype=torch.float32)
w_D = torch.randn(4); print(f"random weight vector: {w_D}")
for (xi_D,yi_1) in zip(X_ND,Y_N):
yihat_1 = f_typechecked(xi_D,w_D)
print(f"expected: ${yi_1}, actual: ${yihat_1:.2f}")
random weight vector: tensor([-0.3343, 0.0768, 2.7828, -0.1331])
expected: $500000.0, actual: $-492.46
expected: $800000.0, actual: $-690.89
expected: $250000.0, actual: $-258.08
Clearly, these outputs are unintelligible, and will only become intelligible with a “good” choice of parameter .
Before we find such a parameter, let’s make one more modification to our function
to increase it’s performance.
Rather then evaluating f_typechecked on every input xi_D in the dataset zip(X_ND,Y_N) sequentially with a loop,
we can evaluate all outputs with a single matrix-vector multiplication by updating ’s function definition to the following
and the corresponding PyTorch updated to
import torch
@jaxtyped(typechecker=beartype)
def f_batched(
X_ND: Float[torch.Tensor, "N D"],
w_D: Float[torch.Tensor, "D"]
) -> Float[torch.Tensor, ""]:
return X_ND@w_D
Now that we’ve selected our model for the task , we can proceed with selecting our performance measure which the learner will improve on with experience .
2.3 Fitting Linear Regression by Directly Optimizing Squared Error
After the inductive bias on the family of functions has been made, the learning algorithm must find the function with a good fit. Since artificial learning algorithms don’t have visual cortex like biological humans[], the notion of “good fit” needs to defined in a systematic fashion. This is done by selecting the parameter which maximizes the likelihood of the data . Returning to the linear regression inductive bias we’ve selected to model the house price data, we assume there exists noise in both our model (epistemic uncertainty) and data (aleatoric uncertainty), so that where
prices are normally distributed conditioned on seeing the features with the mean being the equation of the line where , then we have that
TODO: generate prose/exposition by repaging lecture/text in ram
Returning to the linear regression model, we can solve this optimization with a direct method using normal equations. QR factorization, or SVD.
def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray: return X_n*m+b
if __name__ == "__main__":
X, Y = np.array([1500, 2100, 800]), np.array([500000, 800000, 250000]) # data
X_b = np.column_stack((np.ones_like(X), X)) # [1, x]
bhat, mhat = np.linalg.solve(X_b.T @ X_b, X_b.T @ Y) # w = [b, m]
yhats = fhatbatched(X, mhat, bhat) # yhat
for y, yhat in zip(Y, yhats):
print(f"expected: ${y:.0f}, actual: ${yhat:.2f}")
To summarize, we have selected and computed
- an inductive bias with the family of linear functions
- an inductive principle with the least squared loss
- the parameters which minimze the empirical risk, denoted as
Together, the inductive bias describes the relationship between the input and output spaces, the inductive principle is the loss function that measures prediction accuracy, and the minimization of the empirical risk finds the parameters for the best predictor.
2.4 Predicting Categories by Linearly Modeling Classification
Table of Contents logistic regression, cross entropy loss, gradient descent
# this is an SITP adapted bigram character-level language model taken from karpathy's zero to hero curriculum
# - syllabus: https://karpathy.ai/zero-to-hero.html
# - lecture: https://www.youtube.com/watch?v=PaCmpygFfXo
# - lecture notebook: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part1_bigrams.ipynb
# - makemore: https://github.com/karpathy/makemore/blob/master/makemore.py#L399
# We implement a bigram character-level language model,
# which we will further complexify in followup videos into a modern Transformer language model, like GPT.
# In the lecture above, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently
# evaluating neural networks and (2) the overall framework of language modeling that includes model training,
# sampling, and the evaluation of a loss (e.g. the negative log likelihood for classification).
print("\n\n--- DATA ---")
import torch
import torch.nn.functional as F
dataset = open('./examples/data/names.txt', 'r').read().splitlines()
D = len(dataset) # D is the length of the dataset. N is reserved for the number of examples, in which each di can comprise many of
vocab = sorted(list(set(''.join(dataset)))) # construct vocab
c2i = {c:i+1 for i,c in enumerate(vocab)} # construct map<char,usize>
c2i['.'] = 0 # with . as the start token and end token, to remove counting freq of (<E>*) and (*<S>) which are all 0
V = len(c2i) # evaluate the vocab len V
xindicesraw, yindicesraw = [], []
for di in dataset: # change to dataset[:1] when debugging to limit the number of N examples
di_normalized = ['.'] + list(di) + ['.'] # normalize each word di
for x_char,y_char in zip(di_normalized, di_normalized[1:]): # loop through the (x,y) "self-supervised" bigrams of each word di with zip
x_index, y_index = c2i[x_char], c2i[y_char] # use map<char, usize> to map representation from discrete characters to discrete integers (ords)
xindicesraw.append(x_index), yindicesraw.append(y_index) # append
N = len(xindicesraw)
xindices_N, yindices_N = torch.tensor(xindicesraw), torch.tensor(yindicesraw) # then, convert python lists into torch tensors
print(f'inputs (usize): {xindices_N}'); print(f'outputs (usize): {yindices_N}')
# finally, map representation once again from discrete integers to continuous (but local) one hot vectors
x1hots_NV, y1hots_NV = F.one_hot(xindices_N,num_classes=V).float(), F.one_hot(yindices_N,num_classes=V).float()
print(f'inputs (1hot): {x1hots_NV.shape} {x1hots_NV}'); print(f'outputs (1hot): {x1hots_NV.shape} {x1hots_NV}')
print("\n\n--- ARCH ---")
g = torch.Generator().manual_seed(1337+1) # avgnll: 3.5 -> 4.9
W_VV = torch.randn((V, V), generator=g, requires_grad=True) # V neurons
print("\n\n--- TRAINING ---")
K = 100
lr = 50
print(f'{D=}, {N=}, {K=}')
for k in range(K): # gradient descent
# TRAINING FORWARD {k} --- (including softmax) with N examples from dataset D')
logits_NV = x1hots_NV @ W_VV # batch matmul print(f'logits_NV: {logits_NV.shape}, {logits_NV}');
counts_NV = logits_NV.exp() # equivalent to C_VV print(f'counts_NV: {counts_NV.shape}, {counts_NV}')
probsycondx_NV = counts_NV / counts_NV.sum(dim=1, keepdims=True) # normalize, completing the evaluation of softmax
# print(f'(forward) probsycondx_NV: {probsycondx_NV.shape}\n {probsycondx_NV}')
# vectorized evaluation of loss in TEST section below
indices_N = torch.arange(N) # in order to evaluate loss of first N examples, use torch.arange(N) NOT xindices_N
loss = -probsycondx_NV[indices_N, yindices_N].log().mean() # however we do use yindices_N to find the corresponding likelihood predictions of the *targets*
print(f'{k=}, {loss=}') # pluck the likelihoods, then eval the log, average it, and take the inverse
# ...to get negative log likelihood
# TRAINING BACKWARD {k} --- (including softmax) with N examples from dataset D')
W_VV.grad = None # same as zeroing gradients
loss.backward() # eval f'(x)
# print(f'(backward): {W_VV.grad.shape}, {W_VV.grad}')
# TRAINING STEP {k} ---\n')
W_VV.data += -lr*W_VV.grad
print("\n\n--- INFERENCE ---")
print("\n\n--- TEST ---")
i2c = {i:c for c,i in c2i.items()} # invert map<char, ord> to map<ord, char> for decoding
# logpD,n = 0.0, 0
# nlls = torch.zeros(N) # replace sum and count tracking for average with .mean() (we won't be pushing logpycondx though only -logpycondx)
# for i in range(N): # loop through the 5 input-outputs (x^(i), y^(i)) of the first word di in dataset
# xord, yord = xindices_N[i].item(), yindices_N[i].item() # map (x^(i), y^(i))'s index to ordinal
# xchar, ychar = i2c[xord], i2c[yord] # map (x^(i), y^(i))'s ordinal to char
# pYcondx_V = probsycondx_NV[i]
# pycondx_ = pYcondx_V[yord]
# logpycondx_ = torch.log(pycondx_)
# nll = -logpycondx_
# print(f'(x(^{i}),y(^{i})): ({xchar},{ychar}) ---> py(^{i})condx(^{i})hat: {pycondx_:.4f}, logpy(^{i})condx(^{i}): {logpycondx_:.4f}, nll: {nll:.4f}')
# nlls[i] = nll
# logpD += logpycondx
# n += 1
# nllD = nlls.sum()
# avgnllD = nlls.mean()
# print(f'{nllD=}, {avgnllD=}')
2.5 Fitting Logistic Regression by Iteratively Optimizing Cross Entropy Loss
2.6 Generalized Linear Models
linearity depends on that the scalars are in a field (addition is associative) “scalars” in a field –> we are dealing with floating point representations “vectors” we saw coordinate systems in 1.4 “matrices” and we saw linear transformations in 2.1 mathematically, a vector is a coordinate in some basis of a vector space a matrix is a linear function, the linear, in numerical linear algebra, is aspirational.
2.7 Summary
2.8 Bibliographic Notes
Intermezzo Three: The Language of Matrix Calculus
Further Learning
- MIT 18.06 Linear Algebra (Strang)
- MIT 18.065 Matrix Methods in Machine Learning (Strang)
- UT Austin Linear Algebra: Foundations to Frontiers (van de Geijn)
3. From the Array of IPL to the Multidimensional Array of APL
3.1 From Abstract to Numerical Linear Algebra: Real to Floating Arithmetic
After SITP’s first two chapters, you now understand how to write software 2.0 programs which learn with numpy, and it’s core multidimensional np.ndarray data structure.
For data, instead of using locally discrete representations, software 2.0 programs use dimensionally stochastic representations.
— this is why in Chapter 1 you learned the technique of 1. Representing Data with High Dimensional Stochasticity in numpy
And for functions, instead of constructing maps with explicit instructions, software 2.0 programs recover such maps with differential optimization
— this is why in Chapter 2 you learned the technique of 2. Learning Functions from Data with Parameter Estimation in numpy.
Throughout the first two chapters, a question you may have been asking yourself is exactly how numpy as a library is implemented.
While there are many aspects to numpy (sidenote?), the primary components we care about is the core np.ndarray data structure
and it’s acceleration of basic linear algebra subroutines through the usage of low-level software known as the BLAS.
In this chapter, we will finally understand how numpy works under the hood by implementing our very own np.ndarray with teenygrad.Tensor, and it’s subset of the BLAS with teenygrad.eagkers.
In Part II. Neural Networks’s Chapter 5. Accelerating Sequence Models on GPU in teenygrad,
you will evolve teenygrad’s implementation by adding neural network primitives, gradient-based optimization, and gpu acceleration with cuda cores
in order to support the various inductive biases of deep neural networks explored in the “era of research”, which roughly spans from 2012-2020.
However, before we begin with teenygrad’s implementation,
there is another transition we must make moving from programming machine learning algorithms to programming machine learning systems themselves.
Namely, the understanding that comes with the shift from abstract linear algebra to numerical linear algebra,
which is that the “linear” in computational linear algebra is moreso aspirational.
The first place to understand that is in the transition from real number arithmetic to floating point arithmetic.
Although the discipline of machine learning heavily relies on the language of linear algebra, as you now know from the previous two chapters, the primary vector spaces that are used are d-dimensional coordinate spaces in which we can construct random vectors and their high dimensional joint distributions to represent data. Recall the set-theoretic definitions of , , and so on, up until . Because these definition are constructive, they translate quite easily to code:
from typing import Self
class Rd():
def __init__(self: Self, data: list[T]) -> None:
self.data = data
def __add__(self: Self, other: Self) -> Self:
return Rd([xi + yi for xi, yi in zip(self.data, other.data)])
But then that begs the question, what is the definition of the set ?
class Real():
def __init__(self: Self) -> None: raise NotImplementedError("todo")
def __add__(self: Self, other: Self) -> Self: raise NotImplementedError("todo")
3.2 Virtualizing Logical Tensor.shape to Physical Tensor.storage with Tensor.stride
from typing import Self
import array, math
import teenygrad
class InterpretedTensor:
@classmethod
def arange(cls, end: int, requires_grad: bool=False) -> Self: return InterpretedTensor((end,), list(range(end)), requires_grad=requires_grad)
@classmethod
def zeros(cls, shape: tuple[int, ...]) -> Self:
numel = math.prod(shape)
tensor = InterpretedTensor((numel,), [0.0]*numel).reshape(shape)
return tensor
@classmethod
def ones(cls, shape: tuple[int, ...]) -> Self:
numel = math.prod(shape)
tensor = InterpretedTensor((numel,), [1.0]*numel).reshape(shape)
return tensor
def __init__(self, shape: tuple[int, ...], storage: list[float], inputs: tuple[Self, ...]=(), requires_grad: bool=False) -> None:
self.shape: tuple[int, ...] = shape
self.stride: tuple[int, ...] = [math.prod(shape[i+1:]) for i in range(len(shape))] # row major, and math.prod([]) produces 1
self.storage: list[float] = storage
self.inputs: tuple[Self, ...] = inputs
self._backward = lambda: None # callers override after init with self captured in closure
self.grad: InterpretedTensor = InterpretedTensor.zeros(shape) if requires_grad else None # python can recursively type (no need for Box<_>) bc everything is a heap-allocated reference
@property
def numel(self): return math.prod(self.shape) # np (and thus jax) call this .size
@property
def ndim(self): return len(self.shape)
@property
def T(self) -> Self:
assert self.ndim == 2
m, n = self.shape
t = InterpretedTensor((n, m), self.storage)
t.stride = [self.stride[1], self.stride[0]]
return t
def reshape(self, shape: tuple[int, ...]) -> Self:
self.shape = shape
self.stride = [math.prod(shape[i+1:]) for i in range(len(shape))] # math.prod([]) produces 1
return self
def __repr__(self) -> str:
return f"InterpretedTensor({self.chunk(self.storage, self.shape)})"
@staticmethod
def chunk(flat, shape):
if len(shape) == 1: return flat[:shape[0]]
size = len(flat) // shape[0]
return [InterpretedTensor.chunk(flat[i*size:(i+1)*size], shape[1:]) for i in range(shape[0])]
# backward f'(x)
@staticmethod
def topo(node: InterpretedTensor, seen: set[InterpretedTensor], output: list[InterpretedTensor]) -> None:
if node in seen: return
seen.add(node)
for input in node.inputs: InterpretedTensor.topo(input, seen, output)
output.append(node)
def backward(self) -> None:
seen, topologically_sorted_expression_graph = set(), []
InterpretedTensor.topo(self, seen, topologically_sorted_expression_graph)
self.grad = InterpretedTensor.ones(self.shape) # base case
for tensor in reversed(topologically_sorted_expression_graph): tensor._backward()
# forwards f(x)
def __radd__(self, other: Self) -> Self: return self.__add__(other)
def __add__(self, other: Self) -> Self:
n, alpha = self.numel, 1
x, y, z = array.array('f', self.storage), array.array('f', other.storage), array.array('f', [0.0]*(n))
teenygrad.eagkers.cpu.saxpy(n, alpha, x, y) # y=axpy
requires_grad = self.grad is not None or other.grad is not None
output_tensor = InterpretedTensor(self.shape, list(y), (self, other), requires_grad=requires_grad)
def _backward():
self.grad += output_tensor.grad
other.grad += output_tensor.grad
output_tensor._backward = _backward
return output_tensor
def __rmul__(self, other: Self) -> Self: return self.__mul__(other)
def __mul__(self, other: Self) -> Self:
n = self.numel
x, y, z = array.array('f', self.storage), array.array('f', other.storage), array.array('f', [0.0]*n)
teenygrad.eagkers.cpu.smul(n, x, y, z)
requires_grad = self.grad is not None or other.grad is not None
output_tensor = InterpretedTensor(self.shape, list(z), (self, other), requires_grad=requires_grad)
def _backward():
self.grad += output_tensor.grad * other
other.grad += output_tensor.grad * self
output_tensor._backward = _backward
return output_tensor
def __neg__(self) -> Self:
n = self.numel
x, y = array.array('f', self.storage), array.array('f', [0.0]*n)
teenygrad.eagkers.cpu.saxpy(n, -1, x, y)
requires_grad = self.grad is not None
output_tensor = InterpretedTensor(self.shape, list(y), (self,), requires_grad=requires_grad)
def _backward():
self.grad += -output_tensor.grad
output_tensor._backward = _backward
return output_tensor
def __sub__(self, other: Self) -> Self:
n = self.numel
x, y = array.array('f', other.storage), array.array('f', self.storage)
teenygrad.eagkers.cpu.saxpy(n, -1, x, y)
requires_grad = self.grad is not None or other.grad is not None
output_tensor = InterpretedTensor(self.shape, list(y), (self, other), requires_grad=requires_grad)
def _backward():
self.grad += output_tensor.grad
other.grad += -output_tensor.grad
output_tensor._backward = _backward
return output_tensor
def tanh(self) -> Self:
n = self.numel
x, y = array.array('f', self.storage), array.array('f', [0.0]*n)
teenygrad.eagkers.cpu.stanh(n, x, y)
requires_grad = self.grad is not None
output_tensor = InterpretedTensor(self.shape, list(y), (self,), requires_grad=requires_grad)
def _backward():
self.grad += output_tensor.grad * (InterpretedTensor.ones(self.shape) - output_tensor * output_tensor) # f(x) = tanh(x) ==> f'(x) = 1 - tanh(x)^2
output_tensor._backward = _backward
return output_tensor
def __rmatmul__(self, other: Self) -> Self: return other.__matmul__(self) # GEMM does not commute: AB != BA
def __matmul__(self, other: Self) -> Self:
3.3 Implementing Basic Linear Algebra with Tensor.__add__, and Tensor.__matmul__
3.4 From Virtual to Physical Machines: Rust Translation and RISC-V Evaluation
While we now have a correct implementation of numpy’s multidimensional np.ndarray with teenygrad.Tensor,
there is still exists a large difference between the two, primarily boiling down to performance.
Let’s take the example of solving a linear system where
, , and ,
with the the number of equations (the size of the dataset) and
the number of unknowns (the number of dimensions) being large.
(todo, or linear least squares?):
import numpy as np
import teenygrad as tg
The running time with np.ndarray is X, whereas with tg.Tensor it’s Y.
This is because although you are programming numpy with Python,
underneath the hood it’s core basic linear algebra subroutines are accelerated using native software,
from a library called (for lack of a better term), the basic linear algebra subroutines*
(abbreviated here on in as the BLAS).
The BLAS has a longstanding history dating back
That is, programs whose instructions are executed by the physical machine, rather than a virtual machine.
In order to accelerate tg.Tensor, we must dive deeper in gaining a mechanical empathy for the soul of the machine.
This is where programmers differ from mathematical cousins — not only do we concern ourselves about correctness, but also that of performance —
making us
You have to be kind to the
carcomputer. You feel the poor thing groaning underneath you. If you’re going to push a piece of machinery to the limit, and expect it to hold together, you have to have some sense of where that limit is. Out there, is theperfect lapspeed of light. No mistakes: everygear changeload, everycornerstore. Perfect. Do you see it? Most people don’t.
To build SITP’s capstone framework project of teenygrad, you will need to accelerate the
training and inference of the capstone model project nanochat with many core parallel processors known as GPUs
in Part II’s Chapter 5. Accelerating Sequence Models on GPU in teenygrad with CUDA Rust and PTX.
To prepare you for such massively parallel programming, in the remainder of Chaoter 3 you will
accelerate the basic linear algebra subroutines of teenygrad’s tg.Tensor in order deeply understand the following question:
Q: How do high performance software libraries accelerate their speeds?
A: By accelerating the computation of pipelines and the communication of hierarchies.
Historically speaking, the way in which biological humans first started communicating with digital computers was by writing programs directly in a machine language.
machine programming language: assemblers high level programming languages: fortran, cobol, algol systems programming language: C (only survivor) virtual machine and jits: euluer/pascal, java, javascript, python
The core idea behind the borrow checker is that variables have three kinds of permissions on their data:
- Read (R): data can be copied to another location.
- Write (W): data can be mutated in-place.
- Own (O): data can be moved or dropped.
3.5 Accelerating AXPY and GEMV on CPU via Rooflines
/// SGEMM with the classic BLAS signature (row-major, no transposes):
/// C = alpha * A * B + beta * C
fn sgemm(
m: usize, n: usize, k: usize,
alpha: f32, a: &[f32], lda: usize,
b: &[f32], ldb: usize,
beta: f32, c: &mut [f32], ldc: usize) {
assert!(m > 0 && n > 0 && k > 0, "mat dims must be non-zero");
assert!(lda >= k && a.len() >= m * lda);
assert!(ldb >= n && b.len() >= k * ldb);
assert!(ldc >= n && c.len() >= m * ldc);
for i in 0..m {
for j in 0..n {
let mut acc = 0.0f32;
for p in 0..k { acc += a[i * lda + p] * b[p * ldb + j]; }
let idx = i * ldc + j;
c[idx] = alpha * acc + beta * c[idx];
}
}
}
fn main() {
use std::time::Instant;
for &n in &[16usize, 32, 64, 128, 256] {
let (m, k) = (n, n);
let (a, b, mut c) = (vec![1.0f32; m * k], vec![1.0f32; k * n], vec![0.0f32; m * n]);
let t0 = Instant::now();
sgemm(m, n, k, 1.0, &a, k, &b, n, 0.0, &mut c, n);
let secs = t0.elapsed().as_secs_f64().max(std::f64::MIN_POSITIVE);
let gflop = 2.0 * (m as f64) * (n as f64) * (k as f64) / 1e9;
let gflops = gflop / secs;
println!("m=n=k={n:4} | {:7.3} ms | {:6.2} GFLOP/s", secs * 1e3, gflops);
}
}
Thus far in our journey we’ve successfully made the transition
from programming software 1.0 which involves specifying of discrete data structures — like sets, associations, lists, trees, graphs, — with the determinism of types
to programming software 2.0 which involves reovering continous data structures — like vectors, matrices, and tensors — with the stochasticity of probabilities.
If you showed the mathematical equations for your models, loss functions, and optimizers to our mathematical cousinsIn particular, the physicists whom realized describing reality as minimizing free-energy was fruitful, such as Helmholtz with energy, Gibbs with enthalpy, and Boltzmann with entropy. who also made the same transition,
they would understand the mathematical equations and even algorithmsas until recently, algorithms were understood to be a sequence of instructions to be carried out by a human. i.e Egyptian Multiplication, Euclid’s Greatest Common Divisor.
But they wouldn’t understand how the code we’ve programmed thus far with Python and Numpy is being automatically calculated by the mechanical machine we call computers.
For that we need to transition from users of the numpy framework to implementors our own framework, which we’ll call teenygrad.
This requires us moving from the beautiful abstraction of mathematics to the assembly heart and soul of our machines.
This book uses RustTo backtrack to familiarize yourself with Rust, take a look at the experimental version of The Rust Programming Language, by from Will Crichton and Shriram Krishnamurthi, originally written by Steve Klabnik, Carol Nichols, and Chris Krycho., but feel free to follow along with C/C++In that case, take a look at “The C Programming Language by Brian Kernighan and Dennis Ritchie. — what’s fundamental to the purpose of accelerating said basic linear algebra subroutines on the CPU is
is choosing a language that 1. forgoes the dynamic memory management overhead of garbage collection and
2. has a compiled implementation which allows us to analyze and tune the actual instructions which an underlying processor executes.
Using Rust will allow us to start uncovering what is really happening under the hood of accelerated Tensors
To sidetrack to familiarize yourself with systems programming in the context of software 1.0 , read the classic of Computer Systems A Programmer’s Perspectivewith instruction set architectures, microarchitecture, memory hierarchies
- instruction set architecture
- computation: control path, data path
- communication: memory, input/output
In the last chapter we developed teenygrad.Tensor by virtualizing physical 1D storage buffers into
arbitrarily-shaped logical ND arrays by specifying an iteration space with strides,
and started our journey in accelerating the basic linear algebra subroutines defined on the Tensor with Rust,
which speeds up the throughput performance of SGEMM from 10MFLOP/S to 1GFLOP/S.
Simply executing a processor’s native code — referred to as assembly code — rather than Python’s bytecode
results in an increase of two orders of magnitude.
Now that the code that is being executed is native assembly,
we can dive deeper into the architecture of the machine in order to reason and improve
the performance of teenygrad’s BLAS.
3.6 Accelerating Communication with Hierarchies via Loop Reordering, Register and Cache Blocking
3.7 Accelerating Computation of GEMM with Instruction Level Parallelism via Loop Unrolling

3.8 Summary
One quick way to summarize the milestones in high performance computing, compilers and architecture is to list the Turing Award winners: Alan Perlis (1966) for his influence on advanced programming techniques and compiler construction, including his role in designing ALGOL and establishing the discipline of programming languages as a field; John Backus (1977) for designing FORTRAN — the first high-level language to achieve widespread practical adoption — and formalizing language syntax through Backus-Naur Form; Tony Hoare (1980) for axiomatic semantics, giving programmers a formal logical framework for reasoning about program correctness; Niklaus Wirth (1984) for designing a sequence of clean, teachable languages — EULER, ALGOL-W, Pascal, and Modula — that shaped how programming languages are structured and implemented; John Cocke (1987) for pioneering optimizing compilers and the Reduced Instruction Set Computer (RISC) architecture, showing that simpler instruction sets allow faster hardware; William Kahan (1989) for fundamental contributions to numerical analysis, most consequentially the IEEE 754 floating-point standard that made reliable numerical computation reproducible across hardware; Frederick Brooks (1999) for landmark contributions to computer architecture — most notably the IBM System/360 — and for articulating the enduring lessons of large-scale software engineering; Frances Allen (2006) for foundational contributions to the theory and practice of optimizing compilers, including dataflow analysis and the program dependence graph; John Hennessy and David Patterson (2017) for a systematic, quantitative approach to designing and evaluating computer architectures, whose RISC principles underpin billions of processors and the open RISC-V standard; and Alfred Aho and Jeffrey Ullman (2020) for foundational contributions to programming language theory and compiler construction, most durably codified in the Dragon Book; and finally,Jack Dongarra (2021) for pioneering the numerical libraries — BLAS, LAPACK, and MPI — that became the substrate of high-performance scientific computing and modern deep learning accelerators;