
Structure and Interpretation
of Tensor Programs
The Hacker's Accelerated Introduction to Deep Learning and Deep Learning Systems
Runnanochatby buildingteenygradfrom scratch: the bridge from microgradto tinygrad
Made with 🖤 by Jeffrey Zhang, University of Waterloo (BMath)
First Edition (Draft).
You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.
Dedication

In loving memory of my father, my teacher, and my best friend Dr. Thomas Zhang ND., R.TCMP, R.Ac.
The love I put into this book is but a fraction of the love he gave me.
May you rest in pure land. We’ll meet again dad.
We’ll Meet Again by Vera Lynn 1939. Cover by Johnny Cash 2002.
Presenting an early outline of SITP at Toronto School of Foundation Modeling Season 1 (November 2025)
Preface
The Structure and Interpretation of Tensor Programs
This book is aspirationally titled The Structure and Interpretation of Tensor Programs, (from here on in abbreviated as SITP) as it’s goal is to serve a similar role for software 2.0 as The Structure and Interpretation of Computer Programs (from here on in abbreviated as SICP) did for software 1.0. Written by Harold Abelson and Gerald Sussman with Julie Sussman, SICP took learners on a whimsical whirlwind tour throughout the essence of computation starting with the elements of programs with functional programming, higher order functions, data abstraction, streams, and ending with programming their own programming languages with interpreters, compilers, and register machines.
My alma matter was amongst those which took the SICP approach, and as intended, for someone coming into first year college with high school computer science, it blew my mind. After graduating college in 2022, I followed my curiosity for diving deeper into the souls of our machine by going on to developing industrial languages and runtimes.“There is only one project, architecture, operating system and languages, compiler, it’s only one project. It’s all together.” – Boris Babayan. Particularly, I hacked on domain specific cloud compilers cloud provisioners, and cloud garbage collectors. At the end of 2022 though, when ChatGPT was released by OpenAI my mind was blown twice more. As someone programming since high school, I could not believe this at all. After two more years of hacking on cloud languages and runtimes, I started my transition from domain specific cloud compilers to domain specific tensor compilers.
1.5k lines of rust and 100 commits later, we can now inference the FFN neural language model from (Bengio et al. 2003) straight from Karpathy's Zero to Hero. all you have to do is replace the single "import torch" line with "import picograd" 😎 https://t.co/8paCERz3ry pic.twitter.com/iVKOCsg0zC
— Jeffrey Zhang (@j4orz) April 2, 2025
The transition started with a tweet showcasing the beginnings of a tensor library evaluating the forward pass of a feed forward network
from Andrej Karpathy’s Neural Networks: Zero to Hero course.
While it was illuminating to start implementing each individual torch call that the nets from makemore were making,
my knowledge felt quite fragmented as I forgot a lot of the foundational mathematics I saw in a single semester,
and I wasn’t sure how to bridge myself to industrial deep learning systems like tinygrad, torch, jax, vllm, and sglang.
Coloquially speaking, I was a neural network script kiddie.
Shortly after, I decided to take the plunge and started drinking from the firehose all the mathematical foundation I’ve since forgotten. While revisting preliminary foundation like Strang (1988), Nocedal, Wright (1999), Boyd, Lieven, Vandenberghe (2004) and reading deep learning cannon like Russel, Norvig 1995, Sutton, Barto (1992), Hastie Tibshirani (2001), Goodfellow, Bengio, Courtville (2016), Murphy (2022), the one thought I could not get out of my head was where is the SICP for software 2.0? While I found two excellent resources on building your own torch-like autograd by Tianqi Chen at Carnegie Mellon and Sasha Rush at Cornell, I personally would have really enjoyed a unified resource that took me from math, to deep learning, to deep learning systems in a single unbroken sequence of thought, and perhaps others would feel similarly. That is the genesis story for this book, whose central research question is the following: What should the SICP for Deep Learning look like?
We really could use a SICP for DL. We have the Little Lisper for DL (https://t.co/su31hFJeUe) but that's a different type of book entirely.
— Shriram Krishnamurthi (primary: Bluesky) (@ShriramKMurthi) May 3, 2026
The Structure and Interpretation of the AI Curriculum
goal: teach software 1.0 programmers software 2.0 with software 3.0
An interesting data point is that Codex 5.5 cannot be trusted to design good data structures purely from behavioral prompting. (I'm sure it can come up with good ideas if you prompt it, but not if it's incidental.)
— difficultyang (@difficultyang) May 15, 2026
This post was prompted by Codex coming up with a terrible internal data representation for an autograd tape with some special checkpointing behavior
— difficultyang (@difficultyang) May 15, 2026
Developing GPT is also highly non-trivial, but being able to develop PyTorch requires knowledge of a lot of math and science: calculus, linear algebra, statistics, optimization theory, neural network architecture, electrical engineering, software design, hardware programming,…
— Sebastian Raschka (@rasbt) November 15, 2025
constraints
- sicp style
- runs nanochat
- consolidates gpu mode lectures
- compiles a subset of tinygrad IR
methods
- curriculum
- pedagogy
- language
This work turned out in retrospect to be the seeds of SITP’s core with Part II. Neural Networks which covers the 2012-2020 “era of research” consisting of two chapters:
- Chapter 4. Learning Sequences from Data with Deep Neural Networks
- Chapter 5. Accelerating Sequence Models on
GPU–>
So in Part I. Elements of Networks, readers learn the prelimaniries for “pre-historic” machine learning:
- Chapter 1. Representing Data with High Dimensional Stochasticity
- Chapter 2. Learning Functions from Data with Parameter Estimation
- Chapter 3. Accelerating Functions and Data on
CPU
And in Part III. Scaling Networks, readers learn about the 2020-2025 era of scaling:
- Chapter 6. Large Language Models
- Chapter 7. Reasoning Models
- Chapter 8. Fusion Compilers
- Chapter 9. Inference Engines
Acknowledgements
Thank you to Lambda Labs for the Lambda Labs Research Grant. Thank you to a Cloud-V 10X Engineers and (RISC-V Labs).
Tower of Babel, Genesis 11:1–9
Prologue
In some sense, the 21st century truly began only after the first 20 years past the second millenium, for it was not until the creation of ChatGPT where humanity traded in their so-called bicycles of mind for motorcycle upgrades. From 2020 to 2025, programmers discovered The Scaling LawsFor some popular accounts, see Gwern and Patel, Leech (2025), where pouring internet-scale data into the weights of transformer neural networks with massively parallel and distributed compute produces large language models. In turn these marvelous machines have finally enabled communication between natural organisms and artificial machines at a higher level of abstraction than before, through the means of natural language. It’s not an overstatement to claim that the hottest new programming language is Englisha tweet by Karpathy, Jan 24 2023.
Talking to machines with natural language as opposed to more formal programming languages has been a long standing dream in the science of the mind we call artificial intelligence, which is one of the oldest and arguably most important projects of mankind. Although the methods to build an artificial intelligence will take the remainder of this book to exposit, the goal of the discipline is simply put as building machines that can think.
As usual with most intellectual ideas, many characteristics of artificial intelligence can be traced back to Aristotle, however the philosophical sketches of the field proper arguably started with Descartes, extended by La Mettrie’s Man A Machine, initiated by Leibniz’s Universal Calculus, and applied computationally with Wittgenstein’s Tractatus Logico-Philosophicus and Philosophical Investigations.
The story of artificial intelligence is tightly interconnected with computation, given that the field as we know it today started in earnest during the 20th century at the 1956 Dartmouth Summer Research Project on Artificial Intelligence. There, a group of resesearchers interested in the science of the mind went on to establish the field of “artificial intelligence” a rebranding of the disciplinetainted by the hermaneuticism of psychoanalysis with computational methodsoperationalizing the notion of computationalism rather than the correlative methods of neurosciencesee Could a Neuroscientist Understand a Microprocessor? (Jonas, Kording 2017) and the observational methods of psychology’s behaviorism. Their reasoning, (roughly and reductively) consists of the following:
Since Gödel’s Incompleteness Theorems state that there exist propositions unprovable, and the Church-Turing Thesiswhich SITP’s spiritual predecessor SICP smuggles in through a footnote in chapter 4.1 The Metacircular Evaluator states that all representable languages implementable with computation have the same expressivity, if we ever wanted to physically realize non-biological artificial intelligence, the constructive stateful mathematics on Turing Machines implemented with von Neumann Architectures via electricity and semiconductors are the correct substrate to conduct the science of the mind, as opposed to classical stateless mathematics.
Although united by the idea of using computation to mechanize the mind and thus serve as the basis of artificial intelligence, they were divided between which exact computational techniques and to employ. The two prominent campsothers include embodimentalists, dynamicalists, and self-organizers, which arguably is the next frontier of the discpline. i.e robotics, self-reproducing robotics at the time being the symbolists and the connectionists which in some sense are the primordial “software 1.0” and “software 2.0” we now know today, and in fact, this distinction already explored by our mathematical cousins, when physicists transitioned from using the deterministic logic of Aristotle to to stochastic one of Laplace. More technically, Baye’s Theorem generalizes Aristotelian Logic as a corner case with the probability of some belief given evidence is 1.
With the way in which the field played out, the logical approach with symbolic techniques to artificial intelligence started out as the favorite school of thoughtthus known as “classical” AI or “good-old-fashioned” AI as opposed to the probabilistic approach largely summarized and arguably due to the 1969 book Perceptrons by Marvin Minsky and Seymour Papert. The logical approach used logical tools from logicianslike Frege, Tarski, Brouwer, Gentzen, Curry-Howard, Martin Löf, Girard, and so on to create expert systems such as those of ELIZA. Overtime however, largely due to the continually increasing capability of hardware, the probabilistic approach with machine learning techniques started to see some more success, and empirically validate the idea that GOFAI techniques do not describe large combinatorial phenomena well. (todo: wittgenstein’s tractatus vs PI predicted GOFAI not being able to represent phenomena with too many parts to count)
Ironically enough, although these people whom gather under the umbrella of “connectionists” were united in the idea of using probabilistic learning methodsfollowing the same principle of modeling phenomena with mathematical models where the system minimizes free energy which experienced large success in the 19th century when physicists such as Helmholtz, Gibbs, and Boltzmann modeled energy, enthalpy, and entropy, they themselves were divided between which exact models to employ. The three prominent ones being gaussian processes, kernel machines, and neural networks. Theoretically, (kernel machines todo), but practically neural networks when made deep are able to learn representations. The true watershed moment was during 2012 when a neural network named AlexNet trained on a parallel graphics processor was released, scoring a loss on the ImageNet dataset 10.8% better than the next runner-up.
The 2012-2019 period in deep learning is now becoming known as the era of research, as diverse and various inductive biases were explored through the means of network architectures, resulting in different neural networks such as feedforward neural networks, convolutional neural networks, recurrent neural networks, long-short-term-memory neural networks, and so on, up until the scaling the attention mechanism and feedforward nets inside the transformer archicture started gaining dominance in the 2020-2025 period, now known as the the era of scaling, which brings us back to the present day.

You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.
WeThe following is a modified excerpt from The Structure and Interpretation of Computer Programs Chapter 1: Building Abstractions with Procedures are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of
rules called a programparameters called a model. Peoplecreate programsrecover models to direct processes. In effect, we conjure the spirits of the computer with our spells.
A computational process is indeed much like a sorcerer’s idea of a spirit. It cannot be seen or touched. It is not composed of matter at all. However, it is very real. It can perform intellectual work. It can answer questions. It can affect the world by disbursing money at a bank or by controlling a robot arm in a factory. The programs we use to conjure processes are like a sorcerer’s spells. They are carefully composed fromsymbolicnumerical expressions in arcane and esoteric programming languages that prescribe the tasks we want our processes to perform.
I. Elements of Networks
Although separated by over 2000 years, the programmers of Silicon Valley face a daunting task quite similar to the one encountered by the mathematicians of Ancient Greece. That is, to continue contributing towards the civilizational progress of augmenting and amplifying human intelligence, they must climb back down from their current summit and backtrack to the beginner’s mind they once had.
Not different from learning another mathematical or programming language, they must transition from their finitely discrete structures and deterministic procedures tooling they have grown acustomed to and make the transition to the infintely continuous structures and stochastic procedures. Back then, ancient greek mathematicians were only comfortable with the finiteness of natural numbers like , , and , and grappled with the infinite of the real numbers such as , , and . Similarly, the programmers of today are being asked to transition from programming algorithms of sets, lists, associations, trees, and graphs to the distributions of scalars, vectors, matrices, tensors, and neural networks.
More coloquially, programmers interested in the deep learning approach to artificial intelligence must make the transition from software 1.0software 1.0 to software 2.0software 2.0See (Karpathy 2017), a distinction used to differentiate the classical act of programming software line by line, and the newer approach of programming software by specifying a dataset, a neural net architecture with a goal, and searching the space of programs with compute. How to exactly program with this new approach will take the remainder of the book to explain.
While software 2.0 has increased the intelligence and autonomy of our devices throughout the past decade
— to name a few, language translation with Google Translate, speech recognition with Apple’s Siri, vision with Tesla Autopilot —
at the end of 2022 ChatGPT was released to the world which if a singular point had to be defined, was the true beginning
of software 3.0software 3.0See (Andrej Karpathy 2025),
programming at the highest abstraction level possible with none other than English.
What may be surprising to realize at first glance is that artificial intelligence like ChatGPT is “just” another computer program.
However, rather than being implemented in a language like C, Java, or Javascript, it’s implemented in one that goes by the name of PyTorch,
a software 2.0 programming language centered around the multidimensional array data structure known as torch.Tensor, humbly embedded within a Python package.
In this whirlwind tour dubbed The Structure and Interpretation of Tensor Programs (SITP), we will embark on a quest to build from scratch
our own deep neural network like ChatGPT by implementing nanochat
and our own deep learning framework like PyTorch by implementing teenygrad.
Whether you’re an eager high school student, an up coming college student, or a battle-tested industry programmer,
SITP has been meticulously designed so that the only prerequisite required is a basic familiarity with the art of programming, and high school algebra. Any additional experience is helpful, but not mandatory.
So with that all said, go on young hacker. Venture forth!
Table of Contents
- 0. From Symbolic Software 1.0 to Stochastic Software 2.0
- 0.1 From Psychology to Artificial Intelligence
- 0.2 Weizenbaum’s Turing Test Cheating with the Pattern Matching of
ELIZA - 0.3 Winograd’s Understanding with the Syntactic and Semantic Analysis of
SHRDLU - 0.4 Feigenbaum’s Narrow Expertise with the _ of
DENDRAL - 0.5 Lenat’s Scruffiness with the Ontology and Inference of
CYC - 0.6 Wittgenstein’s Bitter Lesson: From A Logical to Distributional Semantics
- Intermezzo One: The Language of Dependent Type Theory
- 1. Syntactic Sequence Learning with n-gram Models
- 1.2 From Certain to Uncertain Knowledge with Probabilities
- 1.3 Composing Probabilities with Sums, Products, and Conditionals
- 1.4 Random Variables and their Distributions, Expectations, and Variance
- 1.5 Analytical Probabilities from Data by Counting
- 1.6 From Evaluating to Recovering Probabilities with Statistics
- 1.7 Summary
- 1.8 Bibliographic Notes
- Intermezzo Two: The Language of Probability Theory
- 2. Semi-Automated Semantic Sequence Learning with Linear Models
- 2.1 From Representing Discrete Data to Dimensional Data
- 2.2 Predicting Scalars by Linearly Modeling Regression
- 2.3 Directly Fitting Linear Regression with Squared Error Loss
- 2.4 Predicting Categories by Linearly Modeling Classification
- 2.5 Iteratively Fitting Logistic Regression with Cross Entropy Loss
- 2.6 Empirical Risk Minimization is Maximum Likehood Estimation
- 2.7 Summary
- 2.8 Bibliographic Notes
- Intermezzo Three: The Language of Matrix Calculus
- 3. From the Array of IPL to the Multidimensional Array of APL
- 3.1 From Abstract to Numerical Linear Algebra: Real to Floating Arithmetic
- 3.2 Virtualizing Logical Shape to Physical Storage with Strides
- 3.3 Implementing Basic Linear Algebra
- 3.4 From Virtual to Physical Machines: Translation and Evaluation
- 3.5 Accelerating
AXPYandGEMVwith Rooflines - 3.6 Accelerating Communication with Hierarchies
- 3.7 Accelerating Computation with Pipelines
- 3.8 Summary
- 3.9 Bibliographic Notes
- Intermezzo Four: The Language of Rust and RISC-V
0. From Symbolic Software 1.0 to Stochastic Software 2.0
In which we retrace the development of the discretely symbolic approach to building artificially intelligent machines with common sense, and ultimately, it’s failure to describe a reality with too many parts to count, requiring us to transition from deterministically discrete software 1.0 to the stochastically continuous software 2.0.
0.1 From Psychology to Artificial Intelligence
The study of the mind is no different from that of mathematics and music — although forms can change throughout time, their substances remain eternal. In mathematics, representations for arithmetic have evolved from dashes on cave walls, to roman numerals we can still find in books, and finally to modern position-based hindu arabic numerals; In music, representations for pitch have evolved from neumes, relative staffs, and finally to the five-line staff; With the mind, representations for intelligence have evolved from stimulus-response to neural networks, which happened with the birth of the discpline known as artificial intelligence at the Dartmouth Workshop in 1956.
There, a group of researchers unsatisfied with the representations that the discipline of psychology were using to conduct the study of mind came together to discuss an alternate approach. Practical applications are usually a result of inquiry which is usually philosophical, esoteric, and with no immediately economic value. Namely, computers with automating mathematics, and language models with mechanizing the mind. Namely, to use the new way of thinking and conducting science endowed by computers which involves describing complex phenomena by simulating it with procedural epistemologyprocesses. In it’s proposalSee A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence (McCarthy, Minsky, Rochester, Shannon 1955):
We propose that a 2-month, 10-man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.
It’s safe enough to say that 20 person months was overly optimistic on the effort required.
In chapter 0, we will walk through early attempts to build artificial intelligence using symbolic methods.
- Marvin Minsky (1969) and John McCarthy (1971) for defining the foundations of the field based on representation and reasoning;
- Allen Newell and Herbert Simon (1975) for symbolic models of problem solving and human cognition;
0.2 Cheating the Turing Test with Pattern Matching of ELIZA
Tip
Try interacting with both ELIZA and ChatGPT up above to viscerally feel the limitations encountered, overcome, and presented to the world in 2022. To provide a rough reference, the manufacturing of the Boeing 747 started and ended around the same time period.
The original paper that presents ELIZASee A Computer Program For the Study of Natural Language Communication Between Man and Machine (Weizenbaum 1966), begins and ends by explaining the inner workings of the chatbot at a high level:
The gross procedure of the program is quite simple. Input sentences are analyzed on the basis of decomposition rules which are triggered by key words appearing in the input text. Responses are generated by reassembly rules associated with selected decomposition rules.
So at a high level ELIZA interprets language by pattern matching over certain keyword strings and then producing a response based on predefined reassembly rule. Conveniently, Weizenbaum provides a more concrete example further in the paper:
Consider the sentence “I am very unhappy these days”. Suppose a foreigner with only a limited knowledge of English but with a very good ear heard that sentence spoken but understood only the first two words “I am”. Wishing to appear interested, perhaps even sympathetic, he may reply “How long have you been very unhappy these days?”
What he must have done is to apply a kind of template to the original sentence, one part of which matched the two words “I am” and the remainder isolated the words “very unhappy these days”. He must also have a reassembly kit specifically associated with that template, one that specifies that any sentence of the form “I am BLAH” can be transformed to “How long have you been BLAH”, independently of the meaning of BLAH.
A somewhat more complicated example is given by the sentence “It seems that you hate me”. Here the foreigner understands only the words “you” and “me”; i.e., he applies a template that decomposes the sentence into the four parts:
(1) It seems that (2) you (3) hate (4) me
of which only the second and fourth parts are understood. The reassembly rule might then be “What makes you think I hate you”; i.e., it might throw away the first component, translate the two known words (“you” to “I” and “me” to “you”) and tack on a stock phrase (“What makes you think”) to the front of the reconstituted sentence.
This concrete example shows the inspiration and guiding principles for both the data structures to and algorithms used to represent ELIZA’s knowledge and reasoning respectively.
First, not only does the foreigner representing language with symbolic strings, but the foreigner also has a limited, fractional understanding of such vocabulary. So, ELIZA will represent it’s knowledge with strings.
This is quite similar from the tradition in philosophy which represents word meaning by simply spelling the word with capital letters.
The illustrative example is provided by linguist Barbara Partee (1967) where the meaning of life is (represented with) the string "LIFE".
Second, the foreigner produces an answer by transforming a matched template to an incoming sentence with a pre-defined reassembly kit. So, ELIZA will reason by pattern matching over such strings with if-then transformation rules.
Let’s reproduce some of the example scripts presented in the paper’s appendix by re-implementing it’s pattern-match-over-string functionality with Python systematically following How to Design Programs’ Design Recipe. We will walk through the recipe step by step for this first example — all following code examples in the book will compress such process. First, words will be represented with Python strs. So second, the function definition will look like:
def eliza(input: str) -> str:
return NotImplementedError("todo")
Important
Run code examples by clicking the play button in the top right hand of the code snippet.
Third, some examples. Let’s start with the transformation rule where if the user prompts with "I am unhappy" then our mini ELIZA responds with "Why do you say you are unhappy?". Python’s standard library conveniently provides the capability to test interactive examples with docttest. Evaluating the code fails, as expected:
def eliza(input: str) -> str:
"""
>>> eliza("I am unhappy")
'Why do you say you are unhappy?'
>>> eliza("I am upset")
'Why do you say you are upset?'
"""
return NotImplementedError("todo")
if __name__ == "__main__":
import doctest
doctest.testmod()
Although ELIZA’s matching and transformation rules were implemented using decomposition and reassembly
rulesWeizenbaum was a programming language implementor as well, for he also was the creator of the SLIP programming language itself, presented in Symmetric List Processor (Weizenbaum 1963).,
for the fourth and fifth steps, rather than implement our own string matching algorithm,
we will reuse regular expressionsregular expressions
which are highly specialized domain specific languages, namely, in the domain of matching text,
and are are provided by the standard libraries of various languages such as Python, Perl, and Java.
They were originally conceived as notation by Stephen Kleene to formally define the class of regular languages,
regular expressions became mainstream amongst programmers when Ken Thompson built in Kleene’s notation into the QED editor (and later ed)
in order to implement search eventually leading to the commonly used grep, standing for (g)lobal (r)egular (e)xpression and (p)rint.
For Python specifically, regular expressions are accessible via the re module,
where you re.compile(pattern)
a regular expression into a re.Pattern,
which can subsequently be called with Pattern.search(string), Pattern.match(string) and Pattern.fullmatch(string) to
return a corresponding re.Match or None.
For example:
import re
prog = re.compile(r"I am unhappy")
result1 = prog.fullmatch(r"I am unhappy")
result2 = prog.fullmatch(r"You are unhappy")
print(f'{result1=}')
print(f'{result2=}')
However if the pattern is only going to be used once without any reuse, you can evaluate the re.Pattern and re.Match objects with one call:
import re
result = re.fullmatch(r"I am unhappy", "You are unhappy") # returns None
print(f'{result=}')
In the case of ELIZA, we need some way of referencing whatever the user is self reporting so that our system can respond asking why
so that our doctest examples pass. We can do this with a capture group.
todo, after fri/sat/sun
import re
Note
Although we are embarking on a quest to re-implement a Tensor compiler like PyTorch with teenygrad, because the language is embedded into Python, no text parsing is actually required like that of classical compilers (albeit PyTorch’s fusion compiler does have an elaborate way of hooking into the host language of Python).
Interestingly enough though, regular expressions are in fact used to build ChatGPT and subsequently nanochat in what is called tokenization. Things come full circle as Stephen Kleene was originally motivated to describe the artificial neural networks of McCullogh and Pitts.(Kleene 1951 p46) So besides illustrating ELIZA’s GOFAI rules-based approach with regular expressions, becoming somewhat familiar with them will also help you with ChatGPT and nanochat’s tokenization process in Part II. Neural Networks
import re
def eliza(input: str) -> str:
"""
>>> eliza("I am unhappy")
'Why do you say you are unhappy?'
>>> eliza("I am upset")
'Why do you say you are upset?'
"""
pattern, response = r"I am (.*)", "Why do you say you are {0}?"
if m := re.fullmatch(pattern, input, re.IGNORECASE): return response.format(*m.groups())
else: return "Please go on."
print(eliza("I am very unhappy these days")) # How long have you been very unhappy these days?
Finally, let’s create a list of transformation rules
import re, random
transformation_rules = [
(r"I am (.*)", ["How long have you been {0}?", "Why do you say you are {0}?"]),
(r"It seems that (.*)", ["What makes you think {0}?", "Why does it seem that {0}?"]),
(r"(.*) hate (.*)", ["Why do you think {0} hates {1}?", "What makes you feel that way?"]),
(r"(.*)", ["Please go on.", "Tell me more.", "Can you elaborate?"]),
]
and subsequently modify the function body to follow the data by looping over such rules:
import re, random
transformation_rules = [
(r"I am (.*)", ["How long have you been {0}?", "Why do you say you are {0}?"]),
(r"It seems that (.*)", ["What makes you think {0}?", "Why does it seem that {0}?"]),
(r"(.*) hate (.*)", ["Why do you think {0} hates {1}?", "What makes you feel that way?"]),
(r"(.*)", ["Please go on.", "Tell me more.", "Can you elaborate?"]),
]
def eliza(input: str) -> str:
"""
>>> eliza("I am unhappy")
'Why do you say you are unhappy?'
>>> eliza("I am upset")
'Why do you say you are upset?'
"""
for pattern, responses in transformation_rules:
if m := re.fullmatch(pattern, input, re.IGNORECASE): return random.choice(responses).format(*m.groups())
print(eliza("I am very unhappy these days")) # How long have you been very unhappy these days?
print(eliza("It seems that you hate me")) # What makes you think you hate me?
Although we have only implemented a small fraction of the functionalityThe curious reader whom wishes to explore more transformation rules should play around with the embedded ELIZA up above, or take a look at the Appendix in (Weizenbaum 1963). (which, to be fair, the original ELIZA program is only a few hundred lines of code itself),
it illustrates the simplest discrete data structures with strs and deterministic algorithms with if statements used to implement ELIZA
and is representative of the overall GOFAI approach.
Very specifically, with the last transformation rule, you can see how brittle ELIZA’s so-called “understanding” truly is, following quite closely to the foreigner which Weizenbaum literally uses as inspiration.
The primary reason a simple pattern matcher over strings can be endowed with human “understanding” (in other words, why the magic works)
is because of the psychiatric context — especially the Rogerian one with person-centered therapy — where users are effectively talking with oneselves.
Weizenbaum goes on to report that:
This mode of conversation was chosen because the psychiatric interview is one of the few examples of categorized dyadic natural language communication in which one of the participating pair is free to assume the pose of knowing almost nothing of the real world. If, for example, one were to tell a psychiatrist “I went for a long boat ride” and he responded “Tell me about boats”, one would not assume that he knew nothing about boats, but that he had some purpose in so directing the subsequent conversation. It is important to note that this assumption is one made by the speaker. Whether it is realistic or not is an altogether separate question. In any case, it has a crucial psychological utility in that it serves the speaker to maintain his sense of being heard and understood.
While ELIZA’s pattern matching over string methodology with regular expression capture groups and substitutions can perhaps be argued to be sufficient in the psychiatric context, it truly breaks down in others. Even though we can justify this post hoc by simply comparing ELIZA to that of ChatGPT, there is an a priori principle which helped the practitioners of the field to predict ex ante that ELIZA was not the end all be all of a natural language processing system. That principle is the distinction between syntactical formsyntax and semantic meaningsemantics, which practitioners (including Weizenbaum himself!)In a 1978 interview, “Well, I would deny that that there’s any important sense, non-negligible sense in which the program understands. It certainly creates the illusion of understanding. there’s no question about that. But we have to understand that that illusion is an attribution that the person conversing with the program contributes to the conversation. It’s not a function of the program itself.” understood that ELIZA is closer to the former. That is, the system can be described as one with “all style but no substance”:
The ELIZA program itself is merely a translating processor in the technical programming sense. Gorn [2] in a paper on language systems says: ‘Given a language which already possesses semantic content, then a translating processor, even if it operates only syntactically, generates corresponding expressions of another language to which we can attribute as “meanings” (possibly multiple — the translator may not be one to one) the “semantic intents” of the generating source expressions; whether we find the result consistent or useful or both is, of course, another problem.’
Following the last quote from the Weizenbaum paper, ELIZA was indeed not simply a one to one translating processor but a one to many.
Before jumping ship from software 1.0’s good old fashioned discrete data structures and deterministic algorithms,
perhaps to build a semantic natural language processing system like ChatGPT all we need to do the same is increase the expressivity
used for word representations and reasoning? Afterall, all we used were strs and if statements respectively.
Perhaps there’s more to life than "LIFE", and plus, you can hardly call pattern matching an algorithm.
0.3 Understanding with Syntactic and Semantic Analysis of LUNAR and SHRDLU
early q&a systems
- baseball (galaxy. what’s the meaning of life? 42)
- lindsay, buchanan survey: https://dl.acm.org/doi/pdf/10.1145/363707.363732
using first order logic.
- w.a woods
- (1967): https://web.stanford.edu/class/linguist289/woods.pdf
- LUNAR (w.a woods 1973): https://dl.acm.org/doi/pdf/10.1145/1499586.1499695
- https://www.youtube.com/watch?v=vlgVYRXe68Q
- then SHRDLU winograd (1971) https://dspace.mit.edu/entities/publication/3406b826-5a41-4b72-9430-7876c0d5405e
0.4 Feigenbaum’s Narrow Expertise with the _ of DENDRAL
0.5 Minsky’s Society of Mind with Representation and Reasoning of CYC
semantic networks
- 1974 minsky’s framework for representing knowledge
- 1977 bobrow and winograd’s KRL
- 1985 brachman and schmolze KL-ONE
- 1984 CYC paper
0.6 Wittgenstein’s Bitter Lesson: From A Logical to Distributional Semantics
lighthill report (1973)
then expert systems in ’80s (feigenbaum and raj reddy), expert systems being abandoned in ’90s, creating the second winter. Parallel Distributed Processing (Rumelhart and McClelland, 1986)
although obvious posthoc that neural networks, this was all predicated with foresight by wittgenstein.
- from feigenbaum/reddy to pdp
- from the organon (knowledge representation and reasoning with upper ontologies and deductive inference) to norvum organon (occam’s razor)
- from the tractatus to the investigations is effectively the transition from software 1.0 to sofware 2.0
- to understand the claude, we must return to claude
- data science begins where computer science begins
question answering systems eventually incorporated the web as it’s knowledge base, and the field of information retrieval emerged. https://start.csail.mit.edu/index.php
0.7 Summary
One quick way to summarize the software 1.0 approach to AI is to list the Turing Award winners: Marvin Minsky (1969) and John McCarthy (1971) for defining the foundations of the field based on representation and reasoning; Allen Newell and Herbert Simon (1975) for symbolic models of problem solving and human cognition; Ed Feigenbaum and Raj Reddy (1994) for developing expert systems that encode human knowledge to solve real-world problems;
The remainder of the book is dedicated to the software 2.0 approach to AI: Judea Pearl (2011) for developing probabilistic reasoning techniques that deal with uncertainty in a principled manner; Yoshua Bengio, Geoffrey Hinton, and Yann LeCun (2019) for making “deep learning” (multilayer neural networks) a critical part of modern computing; and finally, Richard Sutton, Andrew Barto (2024) for pioneering reinforcement learning in which agents learn by maximizing reward via trial and error
0.8 Bibiliographic Notes
For historical, (Gardner 1958) and (Nilsson 2010) Part II (Chapter 6, 7), Part III (Chapter 11, 13), Part IV (Chapter 18), Part VII (Chapter 26). For GOFAI techniques, (Eisenstein 2018) Chapter 9, 10, 11, 12, 13 (Jurafsky and Martin 2026) Appendix F, G, H, I, J, K, (Bird, Klein, and Loper 2009) Chapter 7, 8, 9, 10 and (Russel and Norvig 2022) Part III (Chapter 7, 8, 9, 10) and Part VI (Chapter 24). For logical foundations, (Brachman and Levesque 2004), (Sowa 2000) and (Genesereth and Nilsson 1987).
rough
In fact, we are in good company as the question was posed all the way back by Alan Turing in his paper Computing Machinery and Intelligence (Turing 1950) as a test to operationalize intelligence, seminating both the fields of artificial intelligence and computer science as we know it today. That same decade marks the official branding of the field at the Dartmouth Workshop (1956), and later that year, a possible solution to build machines that could pass such Turing’s test was proposed by John McCarthysee Oral History of John McCarthy in his paper Programs with Common Sense (McCarthy 1956) where he outlines a hypothetical advice taker program implemented with logically symbolic methods.
Although probabilistically connectionist methods were also explored early on with the artificial neuron — the modelling of a neuron as a weighted sum and threshold function — was invented by Warren McCulloch and Walter Pitts in A Logical Calculus of the Ideas Immanent in Nervous Activity (1943), studied by Marvin Minskysee Oral History of Marvin Minsky shortly after The Society of Mind and the Emotional Machine in his PhD thesis Theory of Neural-Analog Reinforcement Systems and Its Application to the Brain Model Problem (1954) (both before Turing proposed his test), and followed up by Frank Rosenblatt’s composition of such neurons within a single layer neural network i The Perceptron — A Perceiving and Recognizing Automaton (1973), historically speaking the first dominant approach to the artificial intelligence program was the logically symbolic methods espoused by McCarthy.
While it may be clear from today’s vantage point that deep neural networks (more specifically large language models via generative pretrained transformers) are the first systems with common sense that have arguably passsed the aforementioned Turing test, the apriori argument from the logically symbolic school of thought is that modelling individual neurons was similar to a programming model with individual transistors. That is, the former is attempting to describe reality at the wrong abstraction level, and instead, the correct approach is to go straight to the source (so to speak) by replicating the human ability to aquire knowledge through reasoning. This was crisply posited into a single claim known as the physical system hypothesis by Allen Newell and Herbert Simon in Completer Science as Empirical Inquiry: Symbols and Search (1976), These divergent schools of thought to the program of artificial intelligence are respectively known as connectionismconnectionism and symbolismsymbolism. with the latter now coloquially referred to (in a somewhat disparaging manner) GOFAIgood old fashioned AI due to the fact that deep neural networks have been the dominant approach to the artificial intelligence program for the past decade.
As what tends to happen, the dichotomy is moreso a fuzzy spectrum, and their relationship is closer to that of fighting siblings borrowing ideas from one another. Two related tensions which we will explore and transition throughout the book (especially in Part I. Elements of Networks) is in both the mathematical objects and methods used to implement the two different approaches. Objects, meaning the transition from the discretediscrete mathematics of sets, lists, associations, trees, and graphs, to that of the continuouscontinuous mathematics of scalars, vectors, matrices, and tensors. And methods, meaning the continuous objects are in fact non-constructivenon-constructive, and in order to implement them on computers we must constructiveconstructively discretize them with numerical approximations.. By the end of this Part I. Elements of Networks you may wonder if real numbers, are in fact, real.
(start here.)
Returning to the central question of how can we build machines that can understand and produce natural language?, one of the earliest and simplest GOFAI symbolic attempts was that of ELIZA, a computer chatbot implemented in 1966 which simulates a Rogerian psychotherapist conducting person-centered therapy
Systems like ELIZA which followed GOFAI’s approach ran into limitations, ultimately predicted by the philosopher Ludwig Wittgenstein even before the Dartmouth project took place in his book Philosophical Investigations (Wittgenstein 1953) — namely, that of describing a reality with too many parts to count. This is precisely what the connectionist school of thought also known as softwere 2.0 solves with neural networks, and the brain child of such tradition is none other than ChatGPT.
Intermezzo One: The Language of Dependent Type Theory
1. Syntactic Sequence Learning via n-gram Models with teenygrad
1.2 From Certain to Uncertain Knowledge with Probabilities
1.3 Composing Probabilities with Sums, Products, and Conditionals
1.4 Random Variables and their Distributions, Expectations, and Variance
1.5 Analytical Probabilities from Data by Counting
# this is an SITP adapted bigram character-level language model taken from karpathy's zero to hero curriculum
# - syllabus: https://karpathy.ai/zero-to-hero.html
# - lecture: https://www.youtube.com/watch?v=PaCmpygFfXo
# - lecture notebook: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part1_bigrams.ipynb
# - makemore: https://github.com/karpathy/makemore/blob/master/makemore.py#L399
# We implement a bigram character-level language model,
# which we will further complexify in followup videos into a modern Transformer language model, like GPT.
# In the lecture above, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently
# evaluating neural networks and (2) the overall framework of language modeling that includes model training,
# sampling, and the evaluation of a loss (e.g. the negative log likelihood for classification).
dataset = open('./examples/data/names.txt', 'r').read().splitlines()
N = len(dataset)
# Adding some more context to Karpathy's lecture code, the bigram language model is our first software 2.0 "program"
# that describes language probabilistically following the principle of distributional semantics (Harris 1954).
# That is, rather than represent language deterministically with classic GOFAI systems like ELIZA and WordNet,
# the bigram language model represents language stochastically as a random variable Y|X endowed with a conditional probability distribution p(y|x),
# and of course an underlying probability space with the sample space (set of all outcomes) and event space (the set of all possible outocmes).
# Of course, if the stochastic phenomena being described is admits equally likely sample spaces (i.e coin, dice, and cards)
# we can reason, argue, and justify the construction of the associated probability distribution a priori.
# However, it's not so clear what the distribution of *language* is.
# So, we must construct such conditional probabilities *from* data X^(1), X^(2),...,X^(n) by recovering the distribution via *parameter estimation*,
# which is an optimzation problem via maximizing likelihood, which you know from Chapter 2 is equivalent to minimizing empirical risk.
# The goal of the bigram language model is to *recover* such
# autoregressive sequence models (classification likelihood)
# x|y <- (peter abeel/unsupervised/self-supervised)
# ...Histogram (counting frequencies) is the most precise model for training set. it *is* the training set. but it generalizes poorly.
# ...Below, we are building...
conditional_counts_dict = {}
for di in dataset:
di_normalized = ['<S>'] + list(di) + ['<E>']
for x_char,y_char in zip(di_normalized, di_normalized[1:]): # here we
conditional_counts_dict[(x_char,y_char)] = conditional_counts_dict.get((x_char,y_char), 0) + 1
sorted_conditional_counts_dict = sorted(conditional_counts_dict.items(), key = lambda x: -x[1])
# print("2D (char,char) histogram using python's dict:\n", sorted_conditional_counts_dict)
print("--- TRAINING ---")
# We will now construct the same 2d histogram, but with numpy's ndarray instead of python's dict
import numpy as np # we use numpy rather than torch to emphasize that the bigram model does NOT required any deep learning functionality
vocab = sorted(list(set(''.join(dataset)))) # construct vocab
c2i = {c:i+1 for i,c in enumerate(vocab)} # construct map<char,ord>
c2i['.'] = 0 # with . as the start token and end token, to remove counting freq of (<E>*) and (*<S>) which are all 0
V = len(c2i) # evaluate the vocab len V
C_VV = np.zeros((V,V), dtype=np.int32) # and use V to construct C_VV
for di in dataset:
di_normalized = ['.'] + list(di) + ['.']
for x_char,y_char in zip(di_normalized, di_normalized[1:]):
x_index, y_index = c2i[x_char], c2i[y_char] # use map<char, ord> to lookup the coordinate index needed for C_VV
C_VV[x_index, y_index] += 1 # update C_VV
# normalize counts C_VV to probs P_VV
C_VVf32 = (C_VV+1).astype(np.float32) # inductive bias (locally smooth)
s_V1 = C_VVf32.sum(axis=1,keepdims=True) # reduce along axis=1 because we want p(y|x) not p(x|y)
P_VV = C_VVf32 / s_V1 # (V, V) / (V, 1) broadcasts
# because the data matrix D_VV is 2 dimensional, we can print it.
# the way to semantically interpret the table is that is to first understand which axis of the ndarray represents which semantic information
# for D_VV, the elements are the counts of bigrams (c1,c2) which is acessed by ord(c1) with axis=0 and ord(c2) with axis=2
# now, since numpy's ndarray's are row major order, axis=0 gets printed vertically from up to down while axis=1 gets printed horizontally from left to right
i2c = {i:c for c,i in c2i.items()} # invert map<char, ord> to map<ord, char> because looping with enumerate provides access to indices
header = ' ' + ' '.join(f'{i2c[y_index]:>4}' for y_index in range(V))
print("2D (ord, ord) histogram using numpy ndarray")
print(header)
for x_index, row in enumerate(C_VV+1):
x_char = f'{i2c[x_index]:>4}'
print(x_char, ' '.join(f'{count:>4}' for count in row))
print("\n\n--- INFERENCE ---")
rng = np.random.default_rng(1337)
sample_count = 10
for _ in range(sample_count):
output, sample_index = [], 0
while True:
# 1. get p(Y|X)
pYcondX_V = P_VV[sample_index].squeeze()
# 2. sample p(Y|X)
sample_index = rng.choice(len(pYcondX_V), size=1, replace=True, p=pYcondX_V)
sample_char = i2c[sample_index.item()]
output.append(sample_char)
# 3. if sampled end token, break
if sample_index == 0: break
print(''.join(output))
loglikelihooddataset,n = 0.0, 0
for di in dataset:
di_normalized = ['.'] + list(di) + ['.']
for x_char,y_char in zip(di_normalized, di_normalized[1:]):
x_index, y_index = c2i[x_char], c2i[y_char] # use map<char, ord> to lookup the coordinate index needed for D_VV
pycondx = P_VV[x_index, y_index] # maximize likelihood
logpycondx = np.log(pycondx) # maximize loglikelihood
loglikelihooddataset += logpycondx
n += 1
# print(f'{x_char}{y_char}: {pycondx:.4f} {logpycondx:.4f}')
nlldataset = -loglikelihooddataset # minimize -loglikelihood
avgnlldataset = nlldataset / n # minimize -1/n loglikelihood
print(f'{loglikelihooddataset=}')
print(f'{nlldataset=}')
print(f'{avgnlldataset=}')
1.6 Learning Probabilities from Data by Parameter Estimation
1.7 Summary
1.8 Bibliographic Notes
Intermezzo Two: The Language of Probability Theory
Further Learning
- Stanford CS109 Probability for Computer Scientists (Piech)
- Purdue ECE 302 Probability for Data Science (Chan)
- MIT 6.012 Introduction to Probability (Tsitsiklis)
- Carnegie Mellon 36-705 Intermediate Statistics (Wasserman)
2. Semi-Automated Semantic Sequence Learning via Linear Models with teenygrad
from
2.1 From Representing Discrete Data to Dimensional Data
Q: How do language models represent the meaning of words?
A: As high dimensional random vectors in coordinate spaces
Recall Chapter 1’s guiding question from above. After the previous three chapters, with ngram language models, you now have a better understanding of the first difference and required transition between software 1.0 and software 2.0 which is to move from deterministic localist representations of ELIZA to the stochastic distributional representations of ChatGPT, following the principle of distributional semantics by (Harris 1954) and (Firth 1957).
In the next three chapters, you will come to have a better understanding of the second difference and transition which is to move from the discrete symbolic representations of ELIZA (illustrated by WordNet) strings, dictionaries, and graphs to the high dimensional representations of ChatGPT, following the principle of vector semantics by (Osgood et al. 1957).
For this, we introduce the word meaning representations learned by word2vec, in which the Tensorflow Embedding Projector below projects the coordinates from a 200-dimensional space down to 3-dimensional space for our cortex to visualize. In the remaining part of Chapter 1, our focus is on reproducing the learning algorithm that the embedding projector is performing, which is the unsupervised learning of a subspace using the dimensionality reduction of principal component analysis — for this, we introduce the wonderful language of linear algebra.
Are you ready to begin?
vectorvector Grant Sanderson’s 3Blue1Brown Essence of Linear Algebra Chapter 1: Vectors
euclidean spaceeuclidean space
import numpy as np
king_D = np.array([-0.60581, -0.00148, 0.33867, 0.34192, 0.14056, 0.12235, 0.29698, 0.01794, -0.32272, 0.30459, 0.46147, 0.42833, 0.35232, 0.19600, 0.07191, -0.31273, -0.35806, 0.17491, 0.15752, -0.38974, 0.15785, -0.11237, 0.16996, 0.07034, -0.29624, -0.20062, -0.11649, 0.23958, 0.12123, -0.05922, -0.25320, -0.15877, -0.04878, -0.17475, 0.17761, -0.17073, -0.04392, 0.49176, 0.20456, 0.71964, -0.24501, 0.23090, -0.20057, 0.29144, 0.34997, -0.47062, 0.13777, 0.08265, 0.38853, 0.02435, -0.15429, -0.17934, -0.06365, -0.12415, -0.10394, -0.18429, -0.35483, -0.21525, 0.05942, 0.19793, 0.09521, -0.24977, -0.33427, 0.04912, -0.24623, 0.43333, -0.12799, 0.28146, 0.04764, 0.21654, -0.27160, 0.61729, -0.05107, -0.30553, 0.29249, -0.15464, -0.14522, 0.24393, -0.18919, -0.43948, 0.52454, 0.22005, -0.44827, -0.39906, 0.40726, -0.21335, -0.07971, -0.28354, 0.11108, 0.01481, -0.37292, 0.17809, 0.03488, -0.13215, 0.17253, -0.11791, -0.13976, -0.23828, 0.22592, -0.41355, 0.12601, 0.83641, 0.18459, -0.00347, 0.28769, 0.10095, 0.19476, 0.50792, 0.04113, 0.16390, -0.23309, 0.42694, -0.06147, 0.07271, 0.07544, -0.04256, -0.04545, -0.27834, 0.24494, -0.16883, 0.15000, -0.08949, -0.05867, -0.00903, 0.14529, 0.24702, -0.18937, -0.28886, -0.24008, -0.04227, -0.37717, -0.28747, 0.26174, -0.35527, 0.03815, -0.00708, 0.09371, 0.29560, 0.41579, 0.13900, 0.04153, -0.12024, -0.06554, -0.11634, -0.07988, 0.20049, -0.17647, -0.00582, 0.42376, -0.15636, 0.27413, -0.11051, 0.00478, -0.30303, 0.10362, -0.27428, 0.08752, -0.15209, 0.41466, -0.04700, -0.21136, 0.11834, -0.16985, 0.20846, 0.20791, -0.12102, 0.00275, 0.24892, 0.11430, -0.20873, -0.07238, -0.05436, -0.41899, -0.62990, 0.23796, -0.06399, -0.49256, 0.19141, -0.37416, -0.34462, 0.25972, -0.51707, -0.46258, 0.10827, 0.16233, 0.08140, 0.22514, -0.26821, -0.59703, -0.03896, 0.04064, -0.24537, -0.12389, 0.16034, -0.30174, -0.15281, -0.33583, -0.25088, 0.15511, 0.08964], dtype=np.float32)
print("king_D.shape", king_D.shape)
print("king_D.storage", king_D)
matrix vector multiplicationmatrix multiplication
data matrixdata matrix
import numpy as np
try:
from jaxtyping import Float, jaxtyped
from beartype import beartype
except ImportError:
import typing
class _Subscriptable:
def __class_getitem__(cls, _): return typing.Any
Float = _Subscriptable
jaxtyped = lambda typechecker=None: (lambda fn: fn)
beartype = lambda fn: fn
# indices (N=10000): https://raw.githubusercontent.com/tensorflow/embedding-projector-standalone/master/oss_data/word2vec_10000_200d_labels.tsv
# embeddings (D=200): https://raw.githubusercontent.com/tensorflow/embedding-projector-standalone/master/oss_data/word2vec_10000_200d_tensors.bytes
dict_string2dist: dict[str, np.ndarray] = {
"king": np.array([-0.60581, -0.00148, 0.33867, 0.34192, 0.14056, 0.12235, 0.29698, 0.01794, -0.32272, 0.30459, 0.46147, 0.42833, 0.35232, 0.19600, 0.07191, -0.31273, -0.35806, 0.17491, 0.15752, -0.38974, 0.15785, -0.11237, 0.16996, 0.07034, -0.29624, -0.20062, -0.11649, 0.23958, 0.12123, -0.05922, -0.25320, -0.15877, -0.04878, -0.17475, 0.17761, -0.17073, -0.04392, 0.49176, 0.20456, 0.71964, -0.24501, 0.23090, -0.20057, 0.29144, 0.34997, -0.47062, 0.13777, 0.08265, 0.38853, 0.02435, -0.15429, -0.17934, -0.06365, -0.12415, -0.10394, -0.18429, -0.35483, -0.21525, 0.05942, 0.19793, 0.09521, -0.24977, -0.33427, 0.04912, -0.24623, 0.43333, -0.12799, 0.28146, 0.04764, 0.21654, -0.27160, 0.61729, -0.05107, -0.30553, 0.29249, -0.15464, -0.14522, 0.24393, -0.18919, -0.43948, 0.52454, 0.22005, -0.44827, -0.39906, 0.40726, -0.21335, -0.07971, -0.28354, 0.11108, 0.01481, -0.37292, 0.17809, 0.03488, -0.13215, 0.17253, -0.11791, -0.13976, -0.23828, 0.22592, -0.41355, 0.12601, 0.83641, 0.18459, -0.00347, 0.28769, 0.10095, 0.19476, 0.50792, 0.04113, 0.16390, -0.23309, 0.42694, -0.06147, 0.07271, 0.07544, -0.04256, -0.04545, -0.27834, 0.24494, -0.16883, 0.15000, -0.08949, -0.05867, -0.00903, 0.14529, 0.24702, -0.18937, -0.28886, -0.24008, -0.04227, -0.37717, -0.28747, 0.26174, -0.35527, 0.03815, -0.00708, 0.09371, 0.29560, 0.41579, 0.13900, 0.04153, -0.12024, -0.06554, -0.11634, -0.07988, 0.20049, -0.17647, -0.00582, 0.42376, -0.15636, 0.27413, -0.11051, 0.00478, -0.30303, 0.10362, -0.27428, 0.08752, -0.15209, 0.41466, -0.04700, -0.21136, 0.11834, -0.16985, 0.20846, 0.20791, -0.12102, 0.00275, 0.24892, 0.11430, -0.20873, -0.07238, -0.05436, -0.41899, -0.62990, 0.23796, -0.06399, -0.49256, 0.19141, -0.37416, -0.34462, 0.25972, -0.51707, -0.46258, 0.10827, 0.16233, 0.08140, 0.22514, -0.26821, -0.59703, -0.03896, 0.04064, -0.24537, -0.12389, 0.16034, -0.30174, -0.15281, -0.33583, -0.25088, 0.15511, 0.08964], dtype=np.float32),
"queen": np.array([-0.30776, 0.03691, -0.18558, 0.31012, 0.11950, 0.27827, 0.55683, 0.18181, -0.58425, 0.17935, 0.64690, 0.77510, 0.28613, 0.41206, -0.37302, -0.00877, -0.04482, 0.09054, -0.34902, -0.08987, 0.21297, -0.54038, 0.03983, -0.09468, -0.39881, -0.14433, -0.32549, 0.23944, 0.03030, -0.25032, -0.08188, -0.02524, 0.01776, -0.07530, -0.03132, -0.37621, -0.01467, 0.04420, 0.18076, 0.49967, -0.46510, 0.38056, -0.03662, 0.31360, -0.12388, -0.69273, 0.45250, 0.39414, 0.10382, -0.18083, -0.29114, -0.19337, -0.04503, 0.25394, -0.22558, -0.23089, -0.25173, -0.13796, 0.10233, 0.20276, -0.04811, -0.43686, -0.44864, -0.26512, -0.06243, 0.07716, -0.12606, 0.41493, -0.30450, -0.14465, 0.21208, 0.39148, 0.03268, -0.13662, 0.19926, -0.33510, 0.21867, -0.35820, -0.18970, -0.09537, 0.12245, 0.38925, -0.38757, -0.30637, 0.54951, 0.01394, -0.08508, -0.24813, -0.10594, 0.12171, -0.50957, 0.44566, -0.08242, -0.02203, 0.21989, -0.14438, -0.05283, 0.14035, 0.26419, -0.24030, 0.22532, 1.33754, 0.37092, 0.05798, 0.30040, 0.07502, 0.34760, 0.33096, -0.04410, 0.12008, -0.04259, 0.40471, 0.02082, 0.25745, 0.17993, -0.24099, -0.06470, -0.05714, 0.27449, -0.05952, 0.18110, 0.20618, -0.27499, 0.17484, 0.28749, 0.00435, -0.16284, -0.12278, -0.19078, -0.58445, -0.14534, -0.21705, -0.09056, -0.42018, -0.43452, 0.23021, 0.12230, -0.09652, 0.26512, -0.12037, -0.07013, 0.12415, 0.07769, -0.25398, -0.01356, 0.31983, 0.00162, 0.13570, 0.63584, 0.36664, 0.58859, -0.20397, -0.03051, -0.16674, 0.16519, -0.06899, 0.28424, -0.17868, 0.24528, -0.11615, -0.05568, 0.06143, -0.24224, 0.40372, 0.27728, -0.11411, -0.19426, 0.55504, 0.20043, -0.56549, -0.28225, 0.40233, -0.52753, -0.51510, 0.23187, 0.02527, -0.11301, 0.37002, -0.23588, -0.39991, -0.04561, -0.57695, 0.13713, -0.09015, 0.20224, 0.47651, -0.11756, -0.06060, -0.42495, -0.03794, 0.19471, -0.48927, 0.05144, 0.07654, -0.36900, -0.04806, -0.13519, -0.14509, 0.39243, -0.11192], dtype=np.float32),
"man": np.array([-0.11405, 0.08982, 0.15057, 0.08777, -0.21986, -0.04806, 0.27677, -0.35154, -0.29647, 0.22280, 0.39584, 0.45867, 0.03449, 0.18667, 0.11873, -0.03747, -0.18381, 0.03401, 0.06344, 0.11479, 0.15629, -0.28236, 0.18581, 0.16097, 0.02858, -0.25373, -0.15130, 0.16892, -0.04158, -0.50097, 0.41814, 0.04884, 0.03952, 0.02794, 0.06206, -0.39863, 0.00382, 0.09441, 0.04298, 0.09295, -0.06042, -0.09560, -0.00657, -0.19870, -0.09326, -0.55155, -0.07719, 0.00306, 0.03371, 0.17530, 0.00693, 0.04368, 0.14678, -0.07584, -0.39679, -0.40624, 0.08573, -0.13649, 0.65182, -0.02761, 0.38170, 0.15837, 0.10637, -0.04350, -0.36671, 0.31567, 0.18769, 0.34040, -0.14083, 0.33893, -0.22524, 0.40489, -0.02898, -0.50082, 0.31294, -0.66268, 0.19765, 0.06656, -0.01488, -0.01797, 0.41898, -0.12429, -0.15202, -0.48976, 0.22793, 0.15720, -0.21968, 0.15553, 0.00309, 0.06632, -0.41443, 0.32485, -0.23725, -0.16549, -0.11508, 0.02983, 0.21747, 0.06538, 0.01130, -0.24656, -0.06608, 0.32183, 0.06266, -0.13059, 0.10060, -0.06931, 0.12669, 0.41218, 0.27707, 0.18196, 0.21397, 0.60289, -0.04913, 0.18614, -0.04532, 0.02118, 0.00050, -0.06390, -0.01631, 0.16685, 0.40851, -0.10562, -0.00391, -0.14963, 0.23022, 0.05503, 0.13441, -0.35875, -0.32551, -0.00056, 0.17382, -0.29965, 0.01298, -0.23947, -0.08002, -0.00086, 0.16424, 0.05363, 0.01317, 0.13681, 0.04181, -0.13414, -0.00148, -0.18891, 0.22703, 0.34665, -0.06275, 0.01458, -0.00501, 0.32213, 0.16952, 0.01258, -0.05818, 0.02961, -0.06529, -0.31424, -0.21646, -0.05419, 0.08787, -0.03145, 0.07413, 0.07875, -0.05121, -0.06535, -0.24616, 0.00340, 0.04033, 0.21205, 0.16022, -0.05507, 0.20916, 0.02536, -0.19208, -0.02454, 0.13160, 0.02777, -0.16148, -0.11947, -0.05418, -0.24062, 0.03667, 0.23928, -0.08993, -0.28687, -0.00904, 0.31650, 0.47483, -0.04892, -0.26943, -0.17197, 0.16030, -0.12785, -0.46808, 0.06121, -0.16568, -0.02462, -0.04194, -0.05533, -0.05638, -0.14601], dtype=np.float32),
"woman": np.array([-0.20223, 0.40869, -0.01241, -0.06522, 0.01981, 0.53251, 0.51102, -0.28303, -0.10660, 0.38697, 0.10596, 0.68116, 0.10824, 0.41259, -0.02591, -0.00141, 0.24525, 0.30996, 0.17783, 0.37466, 0.17955, -0.21467, 0.19712, 0.00547, 0.18816, -0.32502, -0.15192, 0.21070, -0.33975, -0.25808, 0.22532, 0.02078, -0.11179, -0.41784, 0.13709, -0.52777, -0.27931, -0.35331, -0.10013, 0.42053, -0.25775, 0.17448, 0.03469, 0.30621, -0.11002, -0.46523, -0.07408, -0.04986, 0.28117, 0.17014, -0.09855, 0.29852, 0.16785, -0.08486, -0.55263, -0.82880, 0.02791, -0.15857, 0.42570, 0.07836, 0.09784, -0.03557, 0.22153, -0.22919, -0.29406, 0.14334, 0.23074, 0.31887, -0.48970, 0.04008, 0.37743, 0.22198, -0.22764, -0.31246, 0.42123, -0.68858, 0.35948, 0.05508, -0.18357, -0.39036, 0.44232, 0.04077, 0.05094, -0.50568, 0.49748, 0.24088, -0.37117, 0.10632, -0.21735, 0.07567, -0.44182, 0.70356, -0.13529, -0.15779, -0.11706, -0.30445, 0.23941, 0.06640, -0.04049, 0.04920, 0.10792, 0.52791, 0.31310, -0.37483, 0.15181, 0.10099, 0.06929, 0.23287, 0.57277, -0.17379, 0.61427, 0.43647, 0.12352, 0.08537, 0.24576, -0.13813, -0.11501, -0.19876, -0.23390, 0.38711, 0.45066, -0.04466, -0.02871, 0.03768, 0.14445, -0.13899, 0.44499, -0.32080, -0.43134, -0.11004, -0.13632, -0.13255, 0.06400, -0.21071, -0.33964, 0.35397, 0.39092, -0.04286, -0.12524, 0.02801, -0.23665, -0.24559, 0.09938, -0.01927, -0.15342, 0.30425, -0.10204, -0.20258, -0.09698, 0.46590, 0.23120, -0.10868, -0.08905, -0.05100, 0.29452, -0.26041, -0.17525, 0.10899, 0.21647, -0.24245, 0.26683, 0.08385, 0.04004, 0.13849, -0.40952, 0.18126, -0.25566, 0.37063, 0.04051, -0.28076, -0.05131, 0.46176, -0.26021, -0.38150, 0.12434, -0.18730, -0.06230, 0.04172, 0.02004, -0.18781, 0.01769, 0.27388, -0.12347, -0.17349, -0.23045, 0.34026, -0.12734, -0.09767, -0.22604, -0.12463, 0.02584, -0.28284, -0.19144, -0.03827, -0.26933, 0.18254, -0.15987, -0.52381, 0.10156, -0.43945], dtype=np.float32),
"paris": np.array([-0.18519, -0.01360, -0.32597, 0.45632, 0.30088, 0.20468, -0.27163, 0.25304, -0.39749, 0.07425, 0.63972, 0.69581, -0.19967, 0.19906, -0.05836, 0.01783, -0.19658, 0.49463, -0.43111, -0.59413, 0.34516, 0.14377, -0.13093, 0.10133, 0.00088, -0.35633, 0.20741, 0.06952, -0.16522, 0.18396, 0.03026, 0.04270, -0.38351, -0.28466, 0.31205, -0.18801, -0.06391, -0.38938, -0.20132, 0.33786, -0.26304, 0.00999, 0.29849, 0.04288, -0.32341, -0.31335, -0.00479, -0.14309, 0.11750, -0.11457, 0.36283, -0.18785, 0.12942, 0.13227, -0.31007, -0.05142, 0.05190, -0.05691, 0.26242, 0.39579, 0.51209, 0.16048, -0.13767, -0.36448, -0.34478, 0.41142, -0.21112, 0.59158, -0.17962, 0.25776, 0.38184, 0.70665, 0.33705, -0.03129, 0.56412, -0.38605, -0.17437, 0.07264, -0.32980, -0.30586, 0.46573, 0.41541, -0.04550, -0.79480, 0.36952, -0.05179, -0.02684, -0.26586, -0.21004, -0.24198, -0.66758, 0.51680, 0.00700, -0.47347, -0.63521, -0.46611, 0.24545, 0.24413, -0.33283, -0.29844, 0.10242, 0.71519, 0.20355, -0.09181, 0.43851, 0.31301, -0.21880, 0.11714, 0.13327, -0.08256, 0.13364, -0.25063, -0.47738, -0.08893, 0.09649, -0.57574, 0.19138, -0.14103, 0.51915, 0.18851, 0.03010, 0.09958, -0.48569, 0.13306, 0.02912, 0.01384, 0.25923, 0.10225, -0.09040, -0.32389, -0.24441, -0.09959, 0.15180, -0.23231, 0.05933, 0.13196, 0.57919, -0.37415, 0.15076, -0.00473, -0.00915, -0.21773, 0.00185, -0.31442, -0.05701, -0.16205, -0.20571, -0.30739, 0.34166, 0.32831, 0.24891, -0.10780, 0.35190, 0.49997, -0.07121, -0.06739, 0.23996, -0.51162, 0.35696, -0.19288, -0.06150, -0.00211, -0.10440, -0.20337, 0.21916, -0.24917, -0.10438, 0.46726, 0.09295, -0.61664, -0.01790, -0.40113, -0.32674, -0.45269, -0.05442, -0.18958, -0.17513, -0.12574, 0.05589, -0.12903, 0.15018, -0.24348, 0.12981, -0.05724, 0.02806, 0.42176, -0.12478, -0.03598, -0.29495, 0.05280, 0.30015, -0.06785, -0.03343, -0.06424, 0.15256, -0.49415, -0.15664, -0.56440, -0.16290, -0.42638], dtype=np.float32),
"france": np.array([-0.18219, 0.06323, 0.13155, 0.19178, 0.45262, 0.07333, 0.00745, 0.13176, -0.04027, 0.18680, 0.58815, 0.72726, -0.20907, 0.29821, 0.09369, -0.04996, -0.12140, 0.24599, -0.28084, -0.37325, 0.61308, 0.02491, 0.15377, 0.35447, -0.27714, -0.10280, 0.42167, 0.28323, 0.08624, 0.04554, -0.11203, -0.09141, -0.15559, -0.12945, 0.17985, 0.05397, -0.20674, -0.00157, 0.02078, 0.64305, 0.10637, 0.12813, -0.03183, 0.26590, -0.00545, -0.20655, 0.19704, -0.04977, -0.16935, 0.06854, -0.15001, -0.07362, 0.09009, 0.10963, -0.45723, -0.00408, 0.08126, -0.33670, 0.11713, 0.31598, 0.20151, 0.11635, -0.11566, -0.40446, -0.30366, 0.06843, 0.03220, 0.33058, 0.09078, 0.13582, 0.28371, 0.38705, 0.27897, 0.09391, 0.26882, -0.26058, 0.00064, -0.21219, -0.12738, -0.40399, 0.40183, 0.39339, 0.08117, -0.32321, 0.22676, 0.13104, -0.10246, -0.52784, -0.25873, -0.09247, -0.37017, 0.41755, -0.14285, 0.15307, -0.35519, -0.03849, 0.13401, 0.13112, -0.35291, -0.36664, 0.03136, 0.80553, 0.41274, 0.01832, 0.22198, 0.05979, -0.00397, 0.11959, -0.02017, 0.50051, 0.07584, -0.12657, -0.01628, -0.00025, 0.09829, -0.26941, 0.02597, -0.12949, 0.27935, -0.23689, -0.01785, 0.19605, -0.41837, 0.07977, -0.05644, 0.01962, 0.20559, -0.07866, -0.28765, -0.26882, -0.10859, 0.15512, -0.00888, -0.09430, -0.03187, 0.10582, 0.19566, 0.09776, 0.06356, 0.08710, -0.15624, 0.04377, 0.07377, -0.17253, 0.03722, -0.02852, -0.15565, -0.19175, 0.32213, 0.44915, 0.32993, -0.63005, 0.23579, 0.14376, 0.12855, 0.14385, 0.16940, -0.47164, 0.24567, -0.18858, -0.03186, -0.09289, -0.61278, -0.16884, 0.22779, -0.02604, 0.23900, 0.04567, 0.08428, -0.32738, -0.19242, -0.10240, -0.27779, -0.24567, 0.24114, -0.39621, -0.44449, 0.20193, -0.14436, 0.03190, 0.38711, -0.41903, -0.11760, 0.08490, 0.23199, 0.61604, -0.13850, -0.01056, -0.50419, 0.36371, 0.59154, -0.23931, -0.13358, 0.17867, -0.08431, -0.26031, 0.18438, -0.27460, -0.15105, -0.13703], dtype=np.float32),
"rome": np.array([-0.11204, 0.22966, -0.13668, 0.42749, 0.39183, 0.47309, 0.25666, 0.07822, -0.24798, -0.13575, 0.46827, 0.06815, -0.17023, 0.39090, -0.08079, -0.29234, -0.32003, 0.23166, -0.12039, -0.17116, 0.21441, -0.35751, -0.18170, -0.29657, -0.07359, -0.20686, 0.12989, 0.12463, -0.10202, -0.23102, -0.04400, -0.12517, -0.03459, -0.12223, 0.31041, -0.14974, -0.39004, -0.13628, 0.06586, 0.34412, 0.04909, 0.02732, 0.04259, -0.23218, 0.16646, -0.36989, -0.24263, -0.07604, 0.09067, 0.02604, -0.23639, -0.04283, 0.42932, 0.21634, -0.40271, 0.07764, 0.13056, -0.06083, 0.19618, 0.54013, 0.14888, -0.24765, -0.25302, -0.30739, 0.01029, 0.29831, 0.07945, 0.45129, 0.22313, 0.07088, 0.33073, 0.53051, -0.02538, -0.34967, 0.07828, 0.00247, -0.39456, -0.02917, -0.15832, -0.11131, 0.80995, 0.23170, 0.13627, -0.43837, 0.42694, -0.17925, -0.19091, -0.15370, -0.09075, 0.00463, -0.25978, 0.44907, 0.12163, -0.10187, -0.29649, -0.16680, -0.16898, -0.18889, -0.05334, -0.68051, -0.21924, 1.08901, -0.08938, -0.01084, 0.64356, 0.07706, -0.38139, 0.38721, 0.38537, 0.06449, 0.02829, -0.11856, -0.18428, 0.23894, 0.32629, -0.79727, -0.02919, -0.08177, 0.06304, -0.30370, 0.06447, -0.09200, -0.16051, 0.16706, 0.28596, 0.24376, 0.01903, -0.24568, -0.15003, 0.03472, -0.18781, -0.06927, 0.09267, 0.14936, -0.16940, -0.02855, 0.08061, -0.12080, 0.02909, 0.04295, 0.25802, -0.37922, 0.32467, -0.24179, 0.10241, -0.42759, -0.26248, -0.39167, -0.29641, 0.28513, 0.25339, -0.46392, 0.38545, 0.01698, -0.02567, -0.01231, -0.10522, -0.01214, 0.57213, -0.07360, -0.17799, 0.11318, -0.23048, 0.37151, -0.08626, 0.04074, 0.04089, 0.59131, 0.11168, -0.08926, 0.06659, -0.22881, -0.31912, -0.39018, -0.18094, -0.27141, -0.33712, -0.07491, -0.19420, -0.13975, 0.37040, 0.00748, -0.23212, -0.27160, 0.00709, 0.19954, -0.19594, -0.06819, -0.33766, 0.28512, 0.11063, -0.17811, -0.39012, -0.00868, 0.18033, 0.10791, -0.16924, -0.43922, -0.42524, -0.02648], dtype=np.float32),
"italy": np.array([-0.15340, 0.01927, 0.16926, 0.41839, 0.53047, 0.17460, 0.39970, -0.03224, -0.20870, 0.22605, 0.59484, 0.17090, -0.38235, 0.31133, 0.20538, -0.04493, -0.27729, 0.32064, -0.05734, -0.34299, 0.50433, 0.03479, 0.20769, 0.13809, 0.01233, -0.25943, 0.10879, 0.46709, -0.00929, -0.26202, -0.30844, -0.18392, -0.03692, 0.00350, -0.09822, -0.10232, -0.51924, -0.31376, -0.10353, 0.38236, 0.19212, -0.14145, 0.08053, 0.05962, -0.30530, -0.51329, -0.32229, -0.07294, -0.26528, 0.22517, -0.12624, 0.25401, 0.08901, 0.04630, -0.25246, -0.51830, 0.02983, -0.28021, 0.20391, 0.49642, 0.14568, 0.01941, -0.07164, -0.17974, -0.27320, 0.35721, 0.07217, 0.40795, 0.07221, 0.00986, 0.13876, 0.50403, 0.01540, 0.03218, 0.25567, -0.27005, -0.25822, -0.18744, -0.27059, -0.21146, 0.45264, 0.23391, 0.03439, -0.15279, 0.33551, 0.13907, -0.30575, -0.41350, -0.33525, -0.19169, -0.60787, 0.42002, 0.04735, 0.07948, -0.24537, 0.16873, -0.01630, -0.21978, -0.19681, -0.31645, 0.14211, 0.90986, 0.27263, 0.08888, 0.58419, 0.08633, 0.18484, -0.02439, 0.22421, 0.22149, -0.06583, -0.32640, -0.54394, 0.03884, 0.01773, -0.72140, 0.05312, 0.15851, 0.34632, -0.33199, 0.07043, 0.31847, -0.33012, 0.02117, -0.03411, 0.11081, 0.16380, 0.03609, -0.28953, -0.04240, -0.17554, 0.11706, -0.06782, 0.06808, -0.40821, -0.01016, -0.00167, -0.04228, 0.16646, 0.24710, -0.04438, 0.01156, 0.25375, -0.14640, 0.04143, -0.34428, -0.22431, -0.47259, 0.16204, 0.67493, 0.23886, -0.45043, 0.26658, 0.04719, -0.21568, -0.09716, 0.27766, -0.14342, 0.48150, -0.17313, -0.15145, 0.20802, -0.69512, -0.02889, -0.02191, 0.02746, 0.05462, 0.20904, -0.04408, -0.51693, -0.09057, -0.19239, -0.23878, -0.43717, 0.11679, -0.19186, -0.32573, 0.14074, -0.16839, 0.16045, 0.46155, -0.07977, -0.41751, -0.02728, 0.00373, 0.53923, -0.16084, 0.02926, -0.49980, 0.07807, 0.39652, -0.47138, -0.22931, -0.16463, -0.12133, -0.06451, 0.08685, -0.50419, -0.06190, -0.04058], dtype=np.float32),
"cat": np.array([0.26959, 0.28885, -0.25441, 0.29609, -0.17498, 0.61828, 0.42658, -0.33028, -0.55416, 0.24162, 0.11672, 0.56325, 0.49905, 0.27891, 0.03885, -0.33034, -0.24380, 0.55970, -0.33500, 0.15236, 0.19751, -0.33009, -0.08106, 0.41318, 0.11463, -0.15276, -0.18670, 0.41262, -0.41940, -0.57349, 0.51915, -0.35298, 0.08269, -0.37676, -0.10860, -0.47629, -0.02687, 0.15058, 0.06812, -0.08056, -0.11488, -0.29507, 0.26964, 0.18912, 0.12296, -0.73252, 0.54527, 0.03410, -0.13766, -0.18625, -0.02463, -0.15198, -0.15954, 0.22369, -0.21203, -0.32467, 0.21283, -0.24604, 0.23843, 0.30972, 0.29833, -0.47957, 0.36598, 0.08849, -0.81814, 0.47101, 0.33824, 0.43682, -0.43284, 0.14764, 0.50940, 0.14663, 0.08716, -0.53612, 0.69195, -0.67798, 0.12627, 0.19669, -0.23032, 0.12126, -0.05029, -0.23783, -0.24113, 0.08685, 0.41477, 0.21425, -0.01235, -0.14322, -0.22114, -0.06646, -0.58032, 0.73164, -0.04452, -0.47726, -0.40563, 0.44273, 0.02877, 0.12494, 0.27107, -0.00345, 0.08363, 0.24003, 0.48798, -0.19554, 0.28059, -0.34323, -0.09547, 0.27878, -0.00391, 0.07340, -0.01744, 0.31574, -0.04113, 0.39985, -0.29742, -0.32761, 0.09009, -0.37477, 0.29915, 0.34550, 0.61891, 0.10591, 0.12979, 0.21423, 0.44679, -0.33396, 0.38823, 0.15324, -0.20648, 0.12805, 0.27564, -0.31365, 0.45786, -0.13229, 0.02166, -0.27080, 0.04489, -0.39346, 0.21707, -0.00268, 0.13500, 0.21508, 0.29841, -0.78957, -0.13894, 0.00594, -0.44211, 0.09468, 0.24073, 0.26808, 0.68818, 0.25129, 0.02253, 0.39138, 0.16902, -0.31680, 0.86300, -0.45429, 0.03778, -0.43180, -0.41733, 0.18255, -0.57203, 0.74918, -0.34643, 0.17222, -0.12939, 0.03335, -0.09112, -0.47202, 0.05456, -0.32359, -0.36980, 0.03217, 0.65667, 0.29583, -0.12857, -0.66626, -0.32521, -0.63238, 0.39975, 0.12622, -0.45418, 0.06869, -0.01442, 0.31165, -0.39287, 0.24561, -0.44056, -0.07728, 0.10282, -0.40999, -0.49304, -0.30354, 0.27296, 0.48230, -0.25617, -0.51907, 0.45467, -0.10298], dtype=np.float32),
"dog": np.array([0.22407, 0.19778, -0.12492, 0.02943, -0.01023, 0.30840, 0.47067, -0.09805, -0.24328, 0.48196, -0.05229, 0.67571, 0.28807, 0.28325, -0.07169, 0.07745, -0.50867, 0.33654, -0.31787, 0.21052, -0.05775, -0.40865, 0.29858, 0.16141, -0.22317, -0.16911, -0.06106, 0.17131, -0.45845, -0.23939, 0.02698, 0.24134, 0.10973, -0.17744, -0.03557, -0.17743, 0.09569, 0.08071, 0.49728, -0.24340, -0.02540, 0.07004, 0.41042, -0.03981, 0.17039, -0.64987, 0.03488, 0.04470, 0.07692, 0.08220, 0.02484, -0.28589, 0.37340, -0.10491, -0.32158, -0.33000, -0.21721, -0.07934, 0.43070, 0.18908, 0.37931, -0.13734, 0.39928, -0.08460, -0.66089, 0.37540, 0.46753, 0.48962, -0.21238, 0.11802, 0.26598, 0.01917, 0.59764, -0.42530, 0.65706, -0.36850, 0.22453, 0.14608, -0.08320, -0.04758, 0.15790, -0.61434, 0.05107, -0.15172, 0.09765, -0.14117, -0.34464, 0.08492, -0.23829, -0.07973, -1.01947, 0.40416, -0.21399, -0.47815, -0.16949, -0.08467, 0.14683, -0.06763, 0.24079, 0.40548, -0.28201, 0.33629, -0.06622, -0.39812, 0.27318, 0.21343, 0.22216, 0.69422, 0.02870, 0.14304, -0.15980, 0.75023, -0.00934, -0.02005, -0.18365, 0.21365, 0.05600, -0.39873, 0.36346, 0.08854, 0.39175, -0.05878, -0.10227, 0.05216, 0.30803, -0.17737, 0.23768, 0.18228, -0.17957, -0.11890, -0.23524, -0.37678, 0.17571, -0.24524, -0.25780, -0.02954, 0.08782, -0.07055, 0.11858, 0.17058, 0.09321, 0.04601, 0.01212, -0.47209, -0.34194, 0.14436, -0.80984, 0.35469, 0.50964, -0.37567, 0.79713, 0.03317, -0.02606, -0.17922, -0.14554, -0.22435, 0.17604, -0.42443, -0.06444, 0.05082, -0.42109, -0.40434, -0.31260, 0.26361, -0.00811, 0.14489, 0.23344, 0.20476, 0.09684, -0.40022, -0.07501, -0.28960, -0.49280, -0.43434, 0.45920, 0.08183, 0.01575, -0.15572, -0.47815, -0.16507, 0.31122, 0.47231, -0.34021, -0.26883, 0.02138, 0.38776, 0.20281, -0.10099, -0.55102, -0.15751, 0.29602, -0.26543, -0.40633, -0.05946, -0.14480, -0.35555, 0.14056, -0.35140, 0.20284, 0.17339], dtype=np.float32),
"table": np.array([-0.24251, -0.02881, -0.04184, 0.11320, 0.07676, 0.31765, 0.04621, 0.27889, 0.08690, -0.43657, 0.32255, 0.05912, -0.05635, -0.20437, -0.13043, -0.19339, -0.05859, 0.63758, -0.51283, 0.17039, 0.11268, -0.09266, 0.09368, -0.25822, -0.08813, 0.00769, -0.36577, 0.60987, 0.13547, -0.30731, 0.10317, -0.39184, -0.13124, 0.07061, 0.08820, -0.38111, -0.17262, -0.10532, -0.14675, -0.21426, -0.32254, 0.16817, 0.04332, -0.00583, -0.05811, -0.72103, -0.35971, -0.07069, -0.20500, -0.53578, 0.06537, 0.01355, -0.03441, 0.16508, -0.41717, -0.48964, -0.02400, 0.08731, 0.23961, 0.56417, -0.06052, -0.19098, 0.24184, -0.14215, -0.08555, 0.35112, -0.05380, -0.16210, 0.18068, 0.14535, 0.34078, 0.81779, 0.13351, -0.44080, 0.17847, -0.33536, -0.02692, 0.23795, -0.48981, 0.27085, 0.21318, -0.26794, -0.47646, -0.47865, 0.43042, 0.12462, -0.23047, -0.00634, -0.44237, -0.54118, -0.42910, 0.51173, 0.15166, 0.06616, -0.00050, 0.04257, 0.38788, 0.20401, 0.14725, -0.34527, 0.26782, 0.41257, 0.26215, -0.13500, -0.25081, 0.19053, 0.11008, 0.48502, 0.03358, -0.07010, 0.21036, 0.02110, -0.75852, 0.17353, -0.04961, 0.08930, -0.22679, -0.26382, 0.08666, 0.02339, -0.04276, 0.14007, 0.20233, -0.22444, -0.13015, -0.26452, 0.22854, -0.19656, 0.08982, -0.11065, -0.01083, -0.14230, 0.55542, -0.40077, 0.43193, -0.01451, 0.42432, 0.15159, 0.09869, -0.31474, -0.09218, 0.34171, 0.00812, 0.04699, -0.58361, 0.44378, -1.12451, 0.19731, 0.13642, 0.32610, -0.03793, -0.12543, 0.33448, 0.08818, -0.39563, 0.49534, -0.14100, -0.56391, 0.18019, -0.09507, -0.28584, -0.23887, -0.15002, 0.24602, -0.17264, -0.82555, -0.39407, 0.01208, 0.13483, -0.50127, 0.43927, -0.47681, -0.05441, -0.22195, 0.20945, -0.32043, -0.16494, -0.12170, -0.05999, 0.00696, -0.03224, 0.15639, -0.50188, -0.17326, 0.19615, 0.53008, -0.08557, 0.25442, -0.23749, -0.42338, 0.22123, 0.12498, -0.11213, -0.41178, -0.18820, -0.14456, -0.05304, -0.37926, 0.32974, -0.46066], dtype=np.float32),
"actor": np.array([-0.34081, 0.29496, 0.06248, 0.02581, 0.28570, 0.21760, 0.21162, -0.15345, -0.47740, 0.71132, 0.16290, 0.47373, 0.12448, 0.37240, 0.18513, 0.40150, -0.50863, 0.52983, -0.31956, 0.19327, 0.35366, -0.28345, 0.34000, -0.19229, 0.34402, -0.67792, 0.04072, 0.53878, -0.02601, -0.50897, -0.09529, -0.11532, -0.27105, -0.36717, -0.15617, -0.63148, -0.05268, 0.25237, -0.25513, 0.79959, -0.33079, -0.21113, -0.04093, 0.24306, -0.24102, -0.49024, -0.31635, 0.51906, 0.08259, -0.07093, 0.05762, -0.31068, 0.19975, -0.29971, 0.24759, -0.54989, -0.47743, -0.41719, 0.47868, 0.89240, 0.28170, -0.01673, 0.02956, -0.14534, -0.33202, -0.18015, -0.02318, 0.19072, -0.12974, 0.05497, 0.18587, 0.23634, 0.20169, -0.04462, 0.04632, -0.17732, 0.27773, 0.26587, -0.70917, -0.15072, 0.20083, -0.30606, 0.12365, -0.91372, 0.38745, 0.44369, -0.45890, -0.62393, -0.87489, -0.12528, -0.84243, 0.71368, -0.13824, -0.30721, -0.45125, -0.02410, -0.08495, 0.14208, 0.21343, 0.11456, 0.57746, 0.51518, 0.74713, -0.44282, 0.46277, 0.26860, 0.26077, 0.23483, -0.19523, -0.11328, -0.28465, 0.57823, -0.35354, 0.14874, -0.03730, 0.06788, -0.03338, -0.14657, 0.87178, 0.28849, 0.43849, 0.27989, -0.17953, 0.43501, -0.30121, -0.32942, -0.35602, -0.24987, -0.28101, -0.39073, 0.21141, -0.08250, 0.44348, -0.14257, 0.11083, 0.42711, 0.48078, -0.40088, 0.46692, 0.35036, -0.07102, -0.33256, 0.15478, -0.21769, -0.43711, 0.06046, -0.09511, -0.13849, -0.45530, 0.62191, 0.30547, -0.24135, -0.22935, 0.14534, 0.24329, -0.02397, 0.53990, -0.10288, 0.24224, -0.39644, -0.42022, 0.20305, 0.10345, -0.29594, -0.10168, 0.30617, 0.27342, 0.73330, 0.33558, -0.46253, 0.52831, -0.53941, -0.71974, -0.17000, 0.33471, 0.31247, -0.39216, 0.04075, -0.16177, -0.56716, 0.34696, -0.04393, -0.14483, -0.43536, 0.20816, 0.60463, 0.20165, -0.21832, -0.35554, -0.41931, -0.18691, -0.31201, -0.52478, 0.25814, -0.37204, -0.06317, -0.00777, -0.26136, 0.12635, -0.34427], dtype=np.float32),
"actress": np.array([-0.03609, 0.48259, 0.10655, 0.15476, 0.74611, 0.35226, 0.26543, -0.09824, -0.74194, 0.80526, 0.48950, 0.66349, -0.04611, 0.19693, 0.01656, 0.38376, -0.47511, 0.58739, -0.42210, 0.36695, 0.23638, -0.13260, 0.11278, -0.07940, 0.65258, -0.78240, -0.19338, 0.22543, 0.03659, -0.34090, -0.35696, -0.05129, -0.15844, -0.55043, -0.11435, -0.78682, -0.05935, 0.09748, -0.36884, 0.66672, -0.33169, -0.22609, -0.27201, 0.42125, -0.33582, -0.50228, -0.41939, 0.29667, 0.25257, 0.07340, 0.04644, -0.31010, 0.50941, -0.25233, 0.04330, -0.33538, -0.34326, -0.35768, 0.31392, 0.94899, -0.00139, -0.22785, -0.27498, 0.01661, -0.42216, -0.27224, 0.08699, 0.09866, -0.15192, 0.05982, 0.09475, 0.45494, 0.09575, -0.27159, -0.07097, -0.53471, 0.38938, -0.05162, -0.69329, -0.03841, 0.18119, 0.01944, 0.06560, -1.02032, 0.21652, 0.43684, -0.66284, -0.68760, -0.74737, -0.10185, -0.65727, 0.47983, -0.04231, -0.61117, -0.52429, -0.08394, -0.11283, 0.16290, 0.03889, 0.29620, 0.63489, 0.61709, 0.62876, -0.45758, 0.09166, 0.04892, 0.30382, 0.02665, -0.12999, -0.27397, -0.28191, 0.82811, -0.17342, 0.15485, 0.21338, -0.16891, -0.08952, -0.11762, 0.96667, 0.28835, 0.29522, 0.17668, -0.52001, 0.31824, -0.65100, -0.52572, -0.23879, 0.13706, -0.26424, -0.37772, 0.33564, -0.39051, 0.37396, -0.41123, -0.19594, 0.23036, 0.26708, -0.66899, 0.37358, 0.33660, -0.41522, -0.12397, 0.26031, -0.21327, -0.54083, -0.04208, -0.34020, 0.04574, -0.43365, 1.02148, 0.61688, -0.26216, -0.53543, 0.11807, 0.18984, -0.05739, 0.32445, -0.13142, 0.56787, -0.50690, -0.35184, 0.04989, 0.07077, -0.01400, 0.07540, 0.44800, 0.13940, 0.76294, 0.13431, -0.64686, 0.21227, -0.18815, -0.90020, -0.28664, 0.35314, 0.40869, -0.25572, 0.25900, 0.17797, -0.81918, 0.12856, -0.18146, 0.12532, -0.24020, 0.36360, 0.87130, -0.06650, -0.24994, -0.14889, -0.34492, -0.04197, -0.29965, -0.32774, 0.17399, -0.36773, 0.09036, -0.06386, -0.28132, 0.17017, -0.33808], dtype=np.float32),
}
vocab = list(dict_string2dist.keys())
N = len(vocab)
# f: produces distributional embedding representations in R^D given localist one-hot representation in R^N
@jaxtyped(typechecker=beartype)
def f(x_N: Float[np.ndarray, "N"]) -> Float[np.ndarray, "D"]:
# E_DV = np.stack([dict_string2dist[i] for i in vocab], axis=1)
E_ND = np.stack([dict_string2dist[i] for i in vocab])
y_D = E_ND.T @ x_N
return y_D
xonehot_N = np.eye(N)[vocab.index("man")] # row-wise plucking basis vector from identity matrix I
yembedding_D = f(xonehot_N)
print("xonehot_N.shape", xonehot_N.shape)
print("xonehot_N.storage", xonehot_N)
print("f(xonehot_N) => yembedding_D.shape", yembedding_D.shape)
print("f(xonehot_N) => yembedding_D.storage", yembedding_D)
assert np.allclose(yembedding_D, dict_string2dist["man"])
# f_batched: produces batched distributional embedding representations in R^{BxD} given batched localist one-hot representation in R^{BxN}
@jaxtyped(typechecker=beartype)
def f_batched(X_BN: Float[np.ndarray, "B N"]) -> Float[np.ndarray, "B D"]:
E_ND = np.stack([dict_string2dist[i] for i in vocab])
y_BD = X_BN @ E_ND
return y_BD
Xonehot_BN = np.eye(N)[[vocab.index("king"), vocab.index("man"), vocab.index("woman")]]
Yembedding_BD = f_batched(Xonehot_BN)
print("Xonehot_BN.shape", Xonehot_BN.shape)
print("Xonehot_BN.storage", Xonehot_BN)
print("f_batched(Xonehot_BN) = Yembedding_3D.shape", Yembedding_BD.shape)
print("f_batched(Xonehot_BN) = Yembedding_3D.storage", Yembedding_BD)
assert np.allclose(Yembedding_BD[0], dict_string2dist["king"])
assert np.allclose(Yembedding_BD[1], dict_string2dist["man"])
assert np.allclose(Yembedding_BD[2], dict_string2dist["woman"])
Continuous space language models have recently demonstrated outstanding results across a variety of tasks. In this paper, we examine the vector-space word representations that are implicitly learned by the input-layer weights. We find that these representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset. This allows vector-oriented reasoning based on the offsets between words. For example, the male/female relationship is automatically learned, and with the induced vector representations, “King - Man + Woman” results in a vector very close to “Queen.”
following the mikolov paper..we can ask questions.. if x is a linear combination..
vector space vector space Grant Sanderson’s 3Blue1Brown Essence of Linear Algebra Chapter 16: Abstract Vector Spaces
abbrev VectorSpace (K V : Type) [Field K] [AddCommGroup V] := Module K V
linear combinationlinear combinationGrant Sanderson’s 3Blue1Brown Essence of Linear Algebra Chapter 2: Linear Combinations, Span, and Basis Vectors
def is_linear_combination (S : Set V) (x : V) : Prop :=
∃ (s : Finset V) (f : V → K), (↑s ⊆ S) ∧ (x = Finset.sum s (fun v => f v • v))
spanspan
linear independancelinear independance
basisbasis
standard basisstandard basis
rankrank
dimensiondimension
(show there is no exact linear combination)
(so we need to induce a geometry (length, distance, and angle)) with a norm norms in R^n can be induced with inner products (but not all) inner productinner product inner product spaceinner product space
class InnerProductSpace_v (V : Type) [AddCommGroup V] [VectorSpace ℂ V] where
inner : V → V → ℂ
inner_self_im_zero : ∀ (v : V), (inner v v).im = 0
inner_self_nonneg : ∀ (v : V), 0 ≤ (inner v v).re
inner_self_eq_zero : ∀ (v : V), inner v v = 0 ↔ v = 0
inner_add_left : ∀ (u v w : V), inner (u + v) w = inner u w + inner v w
inner_smul_left : ∀ (a : ℂ) (v w : V), inner (a • v) w = a * inner v w
inner_conj_symm : ∀ (v w : V), inner v w = conj (inner w v)
normnorm
lengthlength
distancedistance
matrix vector multiplicationmatrix multiplication
function matrixfunction matrix
2.2 Predicting Scalars by Linearly Modeling Regression
Similarly to learning distributions from data, learning functions from data roughly follows the three step process of selecting a model for the task , a performance measure , and optimizing such measure on experience . Let’s take a look at the linearly modeling regression with squared error loss, a problem which straddles tools from all three languages of probability theory, linear algebra, and calculus.
For the task of house price prediction, the input space is the size in square feet, and the output space is the price, meaning that the function which needs to be recovered has the type of . An assumption about the structure of the data needs to be made, referred to as the inductive bias. The simplest assumption to make is to assume that there exists some linear relationship between the data and parameters so that ends up being modeled as an inner product plus bias, parameterized as :
One small adjustment we can make to is to fold the final bias term into the vector as , increasing the total dimensionality so that , and adjusting accordingly so that and . Then, the equation of our line with bias is simply parameterized as
Semantically speaking, each property of the input house is weighted by some weight , adjusted accordingly during the learning process so that it accurately reflects the property’s influence on the output price when evaluating a prediction . We can evaluate on all with a randomly intialized to ensure we’ve wired everything up correctly.
However, while specifying the equation of the linear regression model on paper screen
and subsequently evaluating the computation manually might have been sufficient
for pre-computational mathematicians such as Gauss and Legendre,
the discipline of statistical learning is entirely predicated on computational methods.
Let’s continue to use PyTorch’s core n-dimensional array datastructure with torch.Tensor,
this time modeling a function rather than some distribution :
(todo: maybe change example to closure following torch.nn)
import torch
def f(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
return torch.dot(x, w)
in which we can update the names of our torch.Tensor variables with the ranks of their vector
spacesReferred to as Shape Suffixes, described by Noam Shazeer, an engineer from Google
to clarify what dimensions these operations are evaluated on:
import torch
def f_shapesuffixed(x_D: torch.Tensor, w_D: torch.Tensor) -> torch.Tensor:
return torch.dot(x_D, w_D)
and finally, actually enforce them via runtime typechecking with the jaxtyping package
(try swapping the return statements below to trigger a type error with the return type):
import torch
from jaxtyping import Float, jaxtyped
from beartype import beartype
@jaxtyped(typechecker=beartype)
def f_typechecked(
x_D: Float[torch.Tensor, "D"],
w_D: Float[torch.Tensor, "D"]
) -> Float[torch.Tensor, ""]:
# return X_ND
return torch.dot(x_D, w_D)
Let’s now sanity-check that f_typechecked is wired up correctly
by evaluating it on all with random parameter w_D.
(todo: use actual housing data)
import torch
if __name__ == "__main__":
X_ND = torch.tensor([
[1500, 10, 3, 0.8], # house 1
[2100, 2, 4, 0.9], # house 2
[800, 50, 2, 0.3] # house 3
], dtype=torch.float32)
Y_N = torch.tensor([500000, 800000, 250000], dtype=torch.float32)
w_D = torch.randn(4); print(f"random weight vector: {w_D}")
for (xi_D,yi_1) in zip(X_ND,Y_N):
yihat_1 = f_typechecked(xi_D,w_D)
print(f"expected: ${yi_1}, actual: ${yihat_1:.2f}")
random weight vector: tensor([-0.3343, 0.0768, 2.7828, -0.1331])
expected: $500000.0, actual: $-492.46
expected: $800000.0, actual: $-690.89
expected: $250000.0, actual: $-258.08
Clearly, these outputs are unintelligible, and will only become intelligible with a “good” choice of parameter .
Before we find such a parameter, let’s make one more modification to our function
to increase it’s performance.
Rather then evaluating f_typechecked on every input xi_D in the dataset zip(X_ND,Y_N) sequentially with a loop,
we can evaluate all outputs with a single matrix-vector multiplication by updating ’s function definition to the following
and the corresponding PyTorch updated to
import torch
@jaxtyped(typechecker=beartype)
def f_batched(
X_ND: Float[torch.Tensor, "N D"],
w_D: Float[torch.Tensor, "D"]
) -> Float[torch.Tensor, ""]:
return X_ND@w_D
Now that we’ve selected our model for the task , we can proceed with selecting our performance measure which the learner will improve on with experience .
2.3 Fitting Linear Regression by Directly Optimizing Squared Error
After the inductive bias on the family of functions has been made, the learning algorithm must find the function with a good fit. Since artificial learning algorithms don’t have visual cortex like biological humans[], the notion of “good fit” needs to defined in a systematic fashion. This is done by selecting the parameter which maximizes the likelihood of the data . Returning to the linear regression inductive bias we’ve selected to model the house price data, we assume there exists noise in both our model (epistemic uncertainty) and data (aleatoric uncertainty), so that where
prices are normally distributed conditioned on seeing the features with the mean being the equation of the line where , then we have that
TODO: generate prose/exposition by repaging lecture/text in ram
Returning to the linear regression model, we can solve this optimization with a direct method using normal equations. QR factorization, or SVD.
def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray: return X_n*m+b
if __name__ == "__main__":
X, Y = np.array([1500, 2100, 800]), np.array([500000, 800000, 250000]) # data
X_b = np.column_stack((np.ones_like(X), X)) # [1, x]
bhat, mhat = np.linalg.solve(X_b.T @ X_b, X_b.T @ Y) # w = [b, m]
yhats = fhatbatched(X, mhat, bhat) # yhat
for y, yhat in zip(Y, yhats):
print(f"expected: ${y:.0f}, actual: ${yhat:.2f}")
To summarize, we have selected and computed
- an inductive bias with the family of linear functions
- an inductive principle with the least squared loss
- the parameters which minimze the empirical risk, denoted as
Together, the inductive bias describes the relationship between the input and output spaces, the inductive principle is the loss function that measures prediction accuracy, and the minimization of the empirical risk finds the parameters for the best predictor.
2.4 Predicting Categories by Linearly Modeling Classification
Table of Contents logistic regression, cross entropy loss, gradient descent
# this is an SITP adapted bigram character-level language model taken from karpathy's zero to hero curriculum
# - syllabus: https://karpathy.ai/zero-to-hero.html
# - lecture: https://www.youtube.com/watch?v=PaCmpygFfXo
# - lecture notebook: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part1_bigrams.ipynb
# - makemore: https://github.com/karpathy/makemore/blob/master/makemore.py#L399
# We implement a bigram character-level language model,
# which we will further complexify in followup videos into a modern Transformer language model, like GPT.
# In the lecture above, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently
# evaluating neural networks and (2) the overall framework of language modeling that includes model training,
# sampling, and the evaluation of a loss (e.g. the negative log likelihood for classification).
print("\n\n--- DATA ---")
import torch
import torch.nn.functional as F
dataset = open('./examples/data/names.txt', 'r').read().splitlines()
D = len(dataset) # D is the length of the dataset. N is reserved for the number of examples, in which each di can comprise many of
vocab = sorted(list(set(''.join(dataset)))) # construct vocab
c2i = {c:i+1 for i,c in enumerate(vocab)} # construct map<char,usize>
c2i['.'] = 0 # with . as the start token and end token, to remove counting freq of (<E>*) and (*<S>) which are all 0
V = len(c2i) # evaluate the vocab len V
xindicesraw, yindicesraw = [], []
for di in dataset: # change to dataset[:1] when debugging to limit the number of N examples
di_normalized = ['.'] + list(di) + ['.'] # normalize each word di
for x_char,y_char in zip(di_normalized, di_normalized[1:]): # loop through the (x,y) "self-supervised" bigrams of each word di with zip
x_index, y_index = c2i[x_char], c2i[y_char] # use map<char, usize> to map representation from discrete characters to discrete integers (ords)
xindicesraw.append(x_index), yindicesraw.append(y_index) # append
N = len(xindicesraw)
xindices_N, yindices_N = torch.tensor(xindicesraw), torch.tensor(yindicesraw) # then, convert python lists into torch tensors
print(f'inputs (usize): {xindices_N}'); print(f'outputs (usize): {yindices_N}')
# finally, map representation once again from discrete integers to continuous (but local) one hot vectors
x1hots_NV, y1hots_NV = F.one_hot(xindices_N,num_classes=V).float(), F.one_hot(yindices_N,num_classes=V).float()
print(f'inputs (1hot): {x1hots_NV.shape} {x1hots_NV}'); print(f'outputs (1hot): {x1hots_NV.shape} {x1hots_NV}')
print("\n\n--- ARCH ---")
g = torch.Generator().manual_seed(1337+1) # avgnll: 3.5 -> 4.9
W_VV = torch.randn((V, V), generator=g, requires_grad=True) # V neurons
print("\n\n--- TRAINING ---")
K = 100
lr = 50
print(f'{D=}, {N=}, {K=}')
for k in range(K): # gradient descent
# TRAINING FORWARD {k} --- (including softmax) with N examples from dataset D')
logits_NV = x1hots_NV @ W_VV # batch matmul print(f'logits_NV: {logits_NV.shape}, {logits_NV}');
counts_NV = logits_NV.exp() # equivalent to C_VV print(f'counts_NV: {counts_NV.shape}, {counts_NV}')
probsycondx_NV = counts_NV / counts_NV.sum(dim=1, keepdims=True) # normalize, completing the evaluation of softmax
# print(f'(forward) probsycondx_NV: {probsycondx_NV.shape}\n {probsycondx_NV}')
# vectorized evaluation of loss in TEST section below
indices_N = torch.arange(N) # in order to evaluate loss of first N examples, use torch.arange(N) NOT xindices_N
loss = -probsycondx_NV[indices_N, yindices_N].log().mean() # however we do use yindices_N to find the corresponding likelihood predictions of the *targets*
print(f'{k=}, {loss=}') # pluck the likelihoods, then eval the log, average it, and take the inverse
# ...to get negative log likelihood
# TRAINING BACKWARD {k} --- (including softmax) with N examples from dataset D')
W_VV.grad = None # same as zeroing gradients
loss.backward() # eval f'(x)
# print(f'(backward): {W_VV.grad.shape}, {W_VV.grad}')
# TRAINING STEP {k} ---\n')
W_VV.data += -lr*W_VV.grad
print("\n\n--- INFERENCE ---")
print("\n\n--- TEST ---")
i2c = {i:c for c,i in c2i.items()} # invert map<char, ord> to map<ord, char> for decoding
# logpD,n = 0.0, 0
# nlls = torch.zeros(N) # replace sum and count tracking for average with .mean() (we won't be pushing logpycondx though only -logpycondx)
# for i in range(N): # loop through the 5 input-outputs (x^(i), y^(i)) of the first word di in dataset
# xord, yord = xindices_N[i].item(), yindices_N[i].item() # map (x^(i), y^(i))'s index to ordinal
# xchar, ychar = i2c[xord], i2c[yord] # map (x^(i), y^(i))'s ordinal to char
# pYcondx_V = probsycondx_NV[i]
# pycondx_ = pYcondx_V[yord]
# logpycondx_ = torch.log(pycondx_)
# nll = -logpycondx_
# print(f'(x(^{i}),y(^{i})): ({xchar},{ychar}) ---> py(^{i})condx(^{i})hat: {pycondx_:.4f}, logpy(^{i})condx(^{i}): {logpycondx_:.4f}, nll: {nll:.4f}')
# nlls[i] = nll
# logpD += logpycondx
# n += 1
# nllD = nlls.sum()
# avgnllD = nlls.mean()
# print(f'{nllD=}, {avgnllD=}')
2.5 Fitting Logistic Regression by Iteratively Optimizing Cross Entropy Loss
2.6 Generalized Linear Models
linearity depends on that the scalars are in a field (addition is associative) “scalars” in a field –> we are dealing with floating point representations “vectors” we saw coordinate systems in 1.4 “matrices” and we saw linear transformations in 2.1 mathematically, a vector is a coordinate in some basis of a vector space a matrix is a linear function, the linear, in numerical linear algebra, is aspirational.
2.7 Summary
2.8 Bibliographic Notes
Intermezzo Three: The Language of Matrix Calculus
Further Learning
- MIT 18.06 Linear Algebra (Strang)
- MIT 18.065 Matrix Methods in Machine Learning (Strang)
- UT Austin Linear Algebra: Foundations to Frontiers (van de Geijn)
3. From the Array of IPL to the Multidimensional Array of APL
3.1 From Abstract to Numerical Linear Algebra: Real to Floating Arithmetic
After SITP’s first two chapters, you now understand how to write software 2.0 programs which learn with numpy, and it’s core multidimensional np.ndarray data structure.
For data, instead of using locally discrete representations, software 2.0 programs use dimensionally stochastic representations.
— this is why in Chapter 1 you learned the technique of 1. Representing Data with High Dimensional Stochasticity in numpy
And for functions, instead of constructing maps with explicit instructions, software 2.0 programs recover such maps with differential optimization
— this is why in Chapter 2 you learned the technique of 2. Learning Functions from Data with Parameter Estimation in numpy.
Throughout the first two chapters, a question you may have been asking yourself is exactly how numpy as a library is implemented.
While there are many aspects to numpy (sidenote?), the primary components we care about is the core np.ndarray data structure
and it’s acceleration of basic linear algebra subroutines through the usage of low-level software known as the BLAS.
In this chapter, we will finally understand how numpy works under the hood by implementing our very own np.ndarray with teenygrad.Tensor, and it’s subset of the BLAS with teenygrad.eagkers.
In Part II. Neural Networks’s Chapter 5. Accelerating Sequence Models on GPU in teenygrad,
you will evolve teenygrad’s implementation by adding neural network primitives, gradient-based optimization, and gpu acceleration with cuda cores
in order to support the various inductive biases of deep neural networks explored in the “era of research”, which roughly spans from 2012-2020.
However, before we begin with teenygrad’s implementation,
there is another transition we must make moving from programming machine learning algorithms to programming machine learning systems themselves.
Namely, the understanding that comes with the shift from abstract linear algebra to numerical linear algebra,
which is that the “linear” in computational linear algebra is moreso aspirational.
The first place to understand that is in the transition from real number arithmetic to floating point arithmetic.
Although the discipline of machine learning heavily relies on the language of linear algebra, as you now know from the previous two chapters, the primary vector spaces that are used are d-dimensional coordinate spaces in which we can construct random vectors and their high dimensional joint distributions to represent data. Recall the set-theoretic definitions of , , and so on, up until . Because these definition are constructive, they translate quite easily to code:
from typing import Self
class Rd():
def __init__(self: Self, data: list[T]) -> None:
self.data = data
def __add__(self: Self, other: Self) -> Self:
return Rd([xi + yi for xi, yi in zip(self.data, other.data)])
But then that begs the question, what is the definition of the set ?
class Real():
def __init__(self: Self) -> None: raise NotImplementedError("todo")
def __add__(self: Self, other: Self) -> Self: raise NotImplementedError("todo")
3.2 Virtualizing Logical Tensor.shape to Physical Tensor.storage with Tensor.stride
from typing import Self
import array, math
import teenygrad
class InterpretedTensor:
@classmethod
def arange(cls, end: int, requires_grad: bool=False) -> Self: return InterpretedTensor((end,), list(range(end)), requires_grad=requires_grad)
@classmethod
def zeros(cls, shape: tuple[int, ...]) -> Self:
numel = math.prod(shape)
tensor = InterpretedTensor((numel,), [0.0]*numel).reshape(shape)
return tensor
@classmethod
def ones(cls, shape: tuple[int, ...]) -> Self:
numel = math.prod(shape)
tensor = InterpretedTensor((numel,), [1.0]*numel).reshape(shape)
return tensor
def __init__(self, shape: tuple[int, ...], storage: list[float], inputs: tuple[Self, ...]=(), requires_grad: bool=False) -> None:
self.shape: tuple[int, ...] = shape
self.stride: tuple[int, ...] = [math.prod(shape[i+1:]) for i in range(len(shape))] # row major, and math.prod([]) produces 1
self.storage: list[float] = storage
self.inputs: tuple[Self, ...] = inputs
self._backward = lambda: None # callers override after init with self captured in closure
self.grad: InterpretedTensor = InterpretedTensor.zeros(shape) if requires_grad else None # python can recursively type (no need for Box<_>) bc everything is a heap-allocated reference
@property
def numel(self): return math.prod(self.shape) # np (and thus jax) call this .size
@property
def ndim(self): return len(self.shape)
@property
def T(self) -> Self:
assert self.ndim == 2
m, n = self.shape
t = InterpretedTensor((n, m), self.storage)
t.stride = [self.stride[1], self.stride[0]]
return t
def reshape(self, shape: tuple[int, ...]) -> Self:
self.shape = shape
self.stride = [math.prod(shape[i+1:]) for i in range(len(shape))] # math.prod([]) produces 1
return self
def __repr__(self) -> str:
return f"InterpretedTensor({self.chunk(self.storage, self.shape)})"
@staticmethod
def chunk(flat, shape):
if len(shape) == 1: return flat[:shape[0]]
size = len(flat) // shape[0]
return [InterpretedTensor.chunk(flat[i*size:(i+1)*size], shape[1:]) for i in range(shape[0])]
# backward f'(x)
@staticmethod
def topo(node: InterpretedTensor, seen: set[InterpretedTensor], output: list[InterpretedTensor]) -> None:
if node in seen: return
seen.add(node)
for input in node.inputs: InterpretedTensor.topo(input, seen, output)
output.append(node)
def backward(self) -> None:
seen, topologically_sorted_expression_graph = set(), []
InterpretedTensor.topo(self, seen, topologically_sorted_expression_graph)
self.grad = InterpretedTensor.ones(self.shape) # base case
for tensor in reversed(topologically_sorted_expression_graph): tensor._backward()
# forwards f(x)
def __radd__(self, other: Self) -> Self: return self.__add__(other)
def __add__(self, other: Self) -> Self:
n, alpha = self.numel, 1
x, y, z = array.array('f', self.storage), array.array('f', other.storage), array.array('f', [0.0]*(n))
teenygrad.eagkers.cpu.saxpy(n, alpha, x, y) # y=axpy
requires_grad = self.grad is not None or other.grad is not None
output_tensor = InterpretedTensor(self.shape, list(y), (self, other), requires_grad=requires_grad)
def _backward():
self.grad += output_tensor.grad
other.grad += output_tensor.grad
output_tensor._backward = _backward
return output_tensor
def __rmul__(self, other: Self) -> Self: return self.__mul__(other)
def __mul__(self, other: Self) -> Self:
n = self.numel
x, y, z = array.array('f', self.storage), array.array('f', other.storage), array.array('f', [0.0]*n)
teenygrad.eagkers.cpu.smul(n, x, y, z)
requires_grad = self.grad is not None or other.grad is not None
output_tensor = InterpretedTensor(self.shape, list(z), (self, other), requires_grad=requires_grad)
def _backward():
self.grad += output_tensor.grad * other
other.grad += output_tensor.grad * self
output_tensor._backward = _backward
return output_tensor
def __neg__(self) -> Self:
n = self.numel
x, y = array.array('f', self.storage), array.array('f', [0.0]*n)
teenygrad.eagkers.cpu.saxpy(n, -1, x, y)
requires_grad = self.grad is not None
output_tensor = InterpretedTensor(self.shape, list(y), (self,), requires_grad=requires_grad)
def _backward():
self.grad += -output_tensor.grad
output_tensor._backward = _backward
return output_tensor
def __sub__(self, other: Self) -> Self:
n = self.numel
x, y = array.array('f', other.storage), array.array('f', self.storage)
teenygrad.eagkers.cpu.saxpy(n, -1, x, y)
requires_grad = self.grad is not None or other.grad is not None
output_tensor = InterpretedTensor(self.shape, list(y), (self, other), requires_grad=requires_grad)
def _backward():
self.grad += output_tensor.grad
other.grad += -output_tensor.grad
output_tensor._backward = _backward
return output_tensor
def tanh(self) -> Self:
n = self.numel
x, y = array.array('f', self.storage), array.array('f', [0.0]*n)
teenygrad.eagkers.cpu.stanh(n, x, y)
requires_grad = self.grad is not None
output_tensor = InterpretedTensor(self.shape, list(y), (self,), requires_grad=requires_grad)
def _backward():
self.grad += output_tensor.grad * (InterpretedTensor.ones(self.shape) - output_tensor * output_tensor) # f(x) = tanh(x) ==> f'(x) = 1 - tanh(x)^2
output_tensor._backward = _backward
return output_tensor
def __rmatmul__(self, other: Self) -> Self: return other.__matmul__(self) # GEMM does not commute: AB != BA
def __matmul__(self, other: Self) -> Self:
3.3 Implementing Basic Linear Algebra with Tensor.__add__, and Tensor.__matmul__
3.4 From Virtual to Physical Machines: Rust Translation and RISC-V Evaluation
While we now have a correct implementation of numpy’s multidimensional np.ndarray with teenygrad.Tensor,
there is still exists a large difference between the two, primarily boiling down to performance.
Let’s take the example of solving a linear system where
, , and ,
with the the number of equations (the size of the dataset) and
the number of unknowns (the number of dimensions) being large.
(todo, or linear least squares?):
import numpy as np
import teenygrad as tg
The running time with np.ndarray is X, whereas with tg.Tensor it’s Y.
This is because although you are programming numpy with Python,
underneath the hood it’s core basic linear algebra subroutines are accelerated using native software,
from a library called (for lack of a better term), the basic linear algebra subroutines*
(abbreviated here on in as the BLAS).
The BLAS has a longstanding history dating back
That is, programs whose instructions are executed by the physical machine, rather than a virtual machine.
In order to accelerate tg.Tensor, we must dive deeper in gaining a mechanical empathy for the soul of the machine.
This is where programmers differ from mathematical cousins — not only do we concern ourselves about correctness, but also that of performance —
making us
You have to be kind to the
carcomputer. You feel the poor thing groaning underneath you. If you’re going to push a piece of machinery to the limit, and expect it to hold together, you have to have some sense of where that limit is. Out there, is theperfect lapspeed of light. No mistakes: everygear changeload, everycornerstore. Perfect. Do you see it? Most people don’t.
To build SITP’s capstone framework project of teenygrad, you will need to accelerate the
training and inference of the capstone model project nanochat with many core parallel processors known as GPUs
in Part II’s Chapter 5. Accelerating Sequence Models on GPU in teenygrad with CUDA Rust and PTX.
To prepare you for such massively parallel programming, in the remainder of Chaoter 3 you will
accelerate the basic linear algebra subroutines of teenygrad’s tg.Tensor in order deeply understand the following question:
Q: How do high performance software libraries accelerate their speeds?
A: By accelerating the computation of pipelines and the communication of hierarchies.
Historically speaking, the way in which biological humans first started communicating with digital computers was by writing programs directly in a machine language.
machine programming language: assemblers high level programming languages: fortran, cobol, algol systems programming language: C (only survivor) virtual machine and jits: euluer/pascal, java, javascript, python
The core idea behind the borrow checker is that variables have three kinds of permissions on their data:
- Read (R): data can be copied to another location.
- Write (W): data can be mutated in-place.
- Own (O): data can be moved or dropped.
3.5 Accelerating AXPY and GEMV on CPU via Rooflines
/// SGEMM with the classic BLAS signature (row-major, no transposes):
/// C = alpha * A * B + beta * C
fn sgemm(
m: usize, n: usize, k: usize,
alpha: f32, a: &[f32], lda: usize,
b: &[f32], ldb: usize,
beta: f32, c: &mut [f32], ldc: usize) {
assert!(m > 0 && n > 0 && k > 0, "mat dims must be non-zero");
assert!(lda >= k && a.len() >= m * lda);
assert!(ldb >= n && b.len() >= k * ldb);
assert!(ldc >= n && c.len() >= m * ldc);
for i in 0..m {
for j in 0..n {
let mut acc = 0.0f32;
for p in 0..k { acc += a[i * lda + p] * b[p * ldb + j]; }
let idx = i * ldc + j;
c[idx] = alpha * acc + beta * c[idx];
}
}
}
fn main() {
use std::time::Instant;
for &n in &[16usize, 32, 64, 128, 256] {
let (m, k) = (n, n);
let (a, b, mut c) = (vec![1.0f32; m * k], vec![1.0f32; k * n], vec![0.0f32; m * n]);
let t0 = Instant::now();
sgemm(m, n, k, 1.0, &a, k, &b, n, 0.0, &mut c, n);
let secs = t0.elapsed().as_secs_f64().max(std::f64::MIN_POSITIVE);
let gflop = 2.0 * (m as f64) * (n as f64) * (k as f64) / 1e9;
let gflops = gflop / secs;
println!("m=n=k={n:4} | {:7.3} ms | {:6.2} GFLOP/s", secs * 1e3, gflops);
}
}
Thus far in our journey we’ve successfully made the transition
from programming software 1.0 which involves specifying of discrete data structures — like sets, associations, lists, trees, graphs, — with the determinism of types
to programming software 2.0 which involves reovering continous data structures — like vectors, matrices, and tensors — with the stochasticity of probabilities.
If you showed the mathematical equations for your models, loss functions, and optimizers to our mathematical cousinsIn particular, the physicists whom realized describing reality as minimizing free-energy was fruitful, such as Helmholtz with energy, Gibbs with enthalpy, and Boltzmann with entropy. who also made the same transition,
they would understand the mathematical equations and even algorithmsas until recently, algorithms were understood to be a sequence of instructions to be carried out by a human. i.e Egyptian Multiplication, Euclid’s Greatest Common Divisor.
But they wouldn’t understand how the code we’ve programmed thus far with Python and Numpy is being automatically calculated by the mechanical machine we call computers.
For that we need to transition from users of the numpy framework to implementors our own framework, which we’ll call teenygrad.
This requires us moving from the beautiful abstraction of mathematics to the assembly heart and soul of our machines.
This book uses RustTo backtrack to familiarize yourself with Rust, take a look at the experimental version of The Rust Programming Language, by from Will Crichton and Shriram Krishnamurthi, originally written by Steve Klabnik, Carol Nichols, and Chris Krycho., but feel free to follow along with C/C++In that case, take a look at “The C Programming Language by Brian Kernighan and Dennis Ritchie. — what’s fundamental to the purpose of accelerating said basic linear algebra subroutines on the CPU is
is choosing a language that 1. forgoes the dynamic memory management overhead of garbage collection and
2. has a compiled implementation which allows us to analyze and tune the actual instructions which an underlying processor executes.
Using Rust will allow us to start uncovering what is really happening under the hood of accelerated Tensors
To sidetrack to familiarize yourself with systems programming in the context of software 1.0 , read the classic of Computer Systems A Programmer’s Perspectivewith instruction set architectures, microarchitecture, memory hierarchies
- instruction set architecture
- computation: control path, data path
- communication: memory, input/output
In the last chapter we developed teenygrad.Tensor by virtualizing physical 1D storage buffers into
arbitrarily-shaped logical ND arrays by specifying an iteration space with strides,
and started our journey in accelerating the basic linear algebra subroutines defined on the Tensor with Rust,
which speeds up the throughput performance of SGEMM from 10MFLOP/S to 1GFLOP/S.
Simply executing a processor’s native code — referred to as assembly code — rather than Python’s bytecode
results in an increase of two orders of magnitude.
Now that the code that is being executed is native assembly,
we can dive deeper into the architecture of the machine in order to reason and improve
the performance of teenygrad’s BLAS.
3.6 Accelerating Communication with Hierarchies via Loop Reordering, Register and Cache Blocking
3.7 Accelerating Computation of GEMM with Instruction Level Parallelism via Loop Unrolling

3.8 Summary
One quick way to summarize the milestones in high performance computing, compilers and architecture is to list the Turing Award winners: Alan Perlis (1966) for his influence on advanced programming techniques and compiler construction, including his role in designing ALGOL and establishing the discipline of programming languages as a field; John Backus (1977) for designing FORTRAN — the first high-level language to achieve widespread practical adoption — and formalizing language syntax through Backus-Naur Form; Tony Hoare (1980) for axiomatic semantics, giving programmers a formal logical framework for reasoning about program correctness; Niklaus Wirth (1984) for designing a sequence of clean, teachable languages — EULER, ALGOL-W, Pascal, and Modula — that shaped how programming languages are structured and implemented; John Cocke (1987) for pioneering optimizing compilers and the Reduced Instruction Set Computer (RISC) architecture, showing that simpler instruction sets allow faster hardware; William Kahan (1989) for fundamental contributions to numerical analysis, most consequentially the IEEE 754 floating-point standard that made reliable numerical computation reproducible across hardware; Frederick Brooks (1999) for landmark contributions to computer architecture — most notably the IBM System/360 — and for articulating the enduring lessons of large-scale software engineering; Frances Allen (2006) for foundational contributions to the theory and practice of optimizing compilers, including dataflow analysis and the program dependence graph; John Hennessy and David Patterson (2017) for a systematic, quantitative approach to designing and evaluating computer architectures, whose RISC principles underpin billions of processors and the open RISC-V standard; and Alfred Aho and Jeffrey Ullman (2020) for foundational contributions to programming language theory and compiler construction, most durably codified in the Dragon Book; and finally,Jack Dongarra (2021) for pioneering the numerical libraries — BLAS, LAPACK, and MPI — that became the substrate of high-performance scientific computing and modern deep learning accelerators;
3.9 Bibliographic Notes
Intermezzo Four: The Language of Rust and RISC-V
The Creation of Adam, Michelangelo 1508-1512.
You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.
II. Neural Networks
In part one of The Structure and Intepretation of Tensor Programs you have developed a solid foundation in the mathematical preliminaries and statistical models used throughout the machine learning approach to artificial intelligence. It’s amazing how close you are to of deep learning without you even knowing. By the end of Part II. Neural Networks, we will be one step closer in achieving our quest of building our own ChatGPT by reproducing GPT2Presented in Language Models are Unsupervised Multitask Learners (Raford et al. 2019), following Andrej Karpathy’s nanogpt. But before we get there, there is some more work for us to do.
So in Chapter 4. Learning Sequences via Deep Neural Networks with teenygrad,
you will increase the expressivity of the linear models implemented in Part I by non-linearities to get a class of models known as deep neural networks.
In order to build ourselves up to nanogpt, we will hold the training goal of learning sequences constant, and incrementally implement more expressive neural networks architectures
following the nets in Andrej Karpathy’s makemore,
starting from feedforward neural networks (FNNs), to convolutional neural networks (CNNs), to recurrent neural networks (RNNs), and finally, transformer neural networks (GPTs).
These various neural network architectures implement different inductive biases,
which were all explored during the 2012-2019 time period of what is coloqially known as the age of research.
Then, in Chapter 5. Accelerating Sequence Models on GPU in teenygrad,
you will evolve teenygrad from a numerical linear algebra library implemented in Part I to a full blown batteries-included deep learning framework like PyTorch.
This means implementing the optimizers for neural networks
whose evaluations are accelerated with massively parallel processors and
whose gradients are automatically evaluated with an automatic differentiation engine.
After completing part two, you will be ready for Part III. Scaling Networks of the book where we finally achieve our quest of building our own ChatGPT. Part three follows the 2020-2025 time period of what is colloquially as the age of scaling where researchers focused on scaling up the generality of generative pretrained transformers by adding assistant-like behavior in a midtraining phase with reinforcement learning with human feedbackOriginally presented in Introducing ChatGPT (OpenAI 2022), and reproduced by open source in Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al. 2023) and by adding reasoning-like behavior in a posttraining phase with foobarbazOriginally presented in Introducing OpenAI o1 (OpenAI 2024), and reproduced by open source in DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
II. Neural Networks
- 4. Automated Sequence Learning via Deep Neural Networks with
teenygrad- 4.1 From Supervised to Self-Supervised Learning with Sequences
- 4.2 Learning Sequences with Linear Models
- 4.3 From Linear to Non-Linear Learning with Deep Neural Networks
- 4.4 Learning Sequences with Feedforward Neural Networks (FNNs)
- 4.5 Learning Sequences with Convolutional Neural Networks (CNNs)
- 4.6 Learning Sequences with Recurrent Neural Networks (RNNs)
- 4.7 Learning Sequences with Transformer Neural Networks (GPTs)
- 5. Accelerating Deep Neural Networks on GPUs in
teenygrad- 5.1 From Numerical Linear Algebra Libraries to Deep Learning Frameworks
- 5.2 Network Primitives with
teenygrad.nn - 5.3 Automatic Differentiation with
Tensor.forward()andTensor.backward() - 5.4 Gradient-Based Optimization with
teenygrad.optim - 5.4 From
SIMDtoSIMT: From Multi Core to Many Core Processors - 5.5 Accelerating
GEMVonGPUwithCUDA RustandPTXvia Rooflines - 5.6 Accelerating
GEMMonGPUwith Data Reuse - 5.7 Accelerating
GEMMonGPUwith Scheduling
4. Learning Sequences from Data with Deep Neural Networks in torch
(todo, some explanation) on learning sequences
4.1 From Supervised to Self-Supervised Learning with Sequences
4.2 Learning Sequences with Linear Models
# this is an SITP adapted bigram character-level language model taken from karpathy's zero to hero curriculum
# - syllabus: https://karpathy.ai/zero-to-hero.html
# - lecture: https://www.youtube.com/watch?v=PaCmpygFfXo
# - lecture notebook: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part1_bigrams.ipynb
# - makemore: https://github.com/karpathy/makemore/blob/master/makemore.py#L399
# We implement a bigram character-level language model,
# which we will further complexify in followup videos into a modern Transformer language model, like GPT.
# In the lecture above, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently
# evaluating neural networks and (2) the overall framework of language modeling that includes model training,
# sampling, and the evaluation of a loss (e.g. the negative log likelihood for classification).
print("\n\n--- DATA ---")
import torch
import torch.nn.functional as F
dataset = open('./examples/data/names.txt', 'r').read().splitlines()
D = len(dataset) # D is the length of the dataset. N is reserved for the number of examples, in which each di can comprise many of
vocab = sorted(list(set(''.join(dataset)))) # construct vocab
c2i = {c:i+1 for i,c in enumerate(vocab)} # construct map<char,usize>
c2i['.'] = 0 # with . as the start token and end token, to remove counting freq of (<E>*) and (*<S>) which are all 0
V = len(c2i) # evaluate the vocab len V
xindicesraw, yindicesraw = [], []
for di in dataset: # change to dataset[:1] when debugging to limit the number of N examples
di_normalized = ['.'] + list(di) + ['.'] # normalize each word di
for x_char,y_char in zip(di_normalized, di_normalized[1:]): # loop through the (x,y) "self-supervised" bigrams of each word di with zip
x_index, y_index = c2i[x_char], c2i[y_char] # use map<char, usize> to map representation from discrete characters to discrete integers (ords)
xindicesraw.append(x_index), yindicesraw.append(y_index) # append
N = len(xindicesraw)
xindices_N, yindices_N = torch.tensor(xindicesraw), torch.tensor(yindicesraw) # then, convert python lists into torch tensors
print(f'inputs (usize): {xindices_N}'); print(f'outputs (usize): {yindices_N}')
# finally, map representation once again from discrete integers to continuous (but local) one hot vectors
x1hots_NV, y1hots_NV = F.one_hot(xindices_N,num_classes=V).float(), F.one_hot(yindices_N,num_classes=V).float()
print(f'inputs (1hot): {x1hots_NV.shape} {x1hots_NV}'); print(f'outputs (1hot): {x1hots_NV.shape} {x1hots_NV}')
print("\n\n--- ARCH ---")
g = torch.Generator().manual_seed(1337+1) # avgnll: 3.5 -> 4.9
W_VV = torch.randn((V, V), generator=g, requires_grad=True) # V neurons
print("\n\n--- TRAINING ---")
K = 100
lr = 50
print(f'{D=}, {N=}, {K=}')
for k in range(K): # gradient descent
# TRAINING FORWARD {k} --- (including softmax) with N examples from dataset D')
logits_NV = x1hots_NV @ W_VV # batch matmul print(f'logits_NV: {logits_NV.shape}, {logits_NV}');
counts_NV = logits_NV.exp() # equivalent to C_VV print(f'counts_NV: {counts_NV.shape}, {counts_NV}')
probsycondx_NV = counts_NV / counts_NV.sum(dim=1, keepdims=True) # normalize, completing the evaluation of softmax
# print(f'(forward) probsycondx_NV: {probsycondx_NV.shape}\n {probsycondx_NV}')
# vectorized evaluation of loss in TEST section below
indices_N = torch.arange(N) # in order to evaluate loss of first N examples, use torch.arange(N) NOT xindices_N
loss = -probsycondx_NV[indices_N, yindices_N].log().mean() # however we do use yindices_N to find the corresponding likelihood predictions of the *targets*
print(f'{k=}, {loss=}') # pluck the likelihoods, then eval the log, average it, and take the inverse
# ...to get negative log likelihood
# TRAINING BACKWARD {k} --- (including softmax) with N examples from dataset D')
W_VV.grad = None # same as zeroing gradients
loss.backward() # eval f'(x)
# print(f'(backward): {W_VV.grad.shape}, {W_VV.grad}')
# TRAINING STEP {k} ---\n')
W_VV.data += -lr*W_VV.grad
print("\n\n--- INFERENCE ---")
print("\n\n--- TEST ---")
i2c = {i:c for c,i in c2i.items()} # invert map<char, ord> to map<ord, char> for decoding
# logpD,n = 0.0, 0
# nlls = torch.zeros(N) # replace sum and count tracking for average with .mean() (we won't be pushing logpycondx though only -logpycondx)
# for i in range(N): # loop through the 5 input-outputs (x^(i), y^(i)) of the first word di in dataset
# xord, yord = xindices_N[i].item(), yindices_N[i].item() # map (x^(i), y^(i))'s index to ordinal
# xchar, ychar = i2c[xord], i2c[yord] # map (x^(i), y^(i))'s ordinal to char
# pYcondx_V = probsycondx_NV[i]
# pycondx_ = pYcondx_V[yord]
# logpycondx_ = torch.log(pycondx_)
# nll = -logpycondx_
# print(f'(x(^{i}),y(^{i})): ({xchar},{ychar}) ---> py(^{i})condx(^{i})hat: {pycondx_:.4f}, logpy(^{i})condx(^{i}): {logpycondx_:.4f}, nll: {nll:.4f}')
# nlls[i] = nll
# logpD += logpycondx
# n += 1
# nllD = nlls.sum()
# avgnllD = nlls.mean()
# print(f'{nllD=}, {avgnllD=}')
4.3 From Linear to Non-Linear Learning with Deep Neural Networks
Recall that the task of house price prediction and sentiment classification which can be modelled by functions of the form and respectively. The simplest inductive bias was made, in a which a linear relationship was assumed to hold between the input and output spaces, and where the output was subsequently modeled as an inner product between an input vector and a weight vector. For the case of regression, we have , and for the case of classification, we have , todo:glm,exp. The key entry point into the function class of deep neural networks is that of logistic regression, because the log odds produced by the inner product (which are indeed affine) required a mapping into a valid probability via sigmoid function , where which is in fact not linear nor affine.
The next natural question to ask then is whether the logistic regression model is considered a deep neural network? The answer is that technically yes, it can be considered a degenerative deep neural network with hidden layersIn the same way that a list can be considered a degenerative binary tree or graph. These so-called hidden layers automate the construction of the representation through learning, so that the model not only discovers the mapping from representation to output, but also the representation itselfIn the same way that for certain computations the positional representation of arabic numerals are more suitable compared to that of roman numerals, and polar coordinates over cartesian coordinates. See [A Representational Analysis of Numeration Systems (Zhang, Norman 1995)](A representational analysis of numeration systems Author links open overlay panel). The functions of deep neural networks will take the form of with being a linear classifier on feature extractor , where is the number of compositional layers, and each intermediate function has the form of . Each hidden layer successively and graduallyIn the same way of Grothendieck’s preferred style of mathematics described in Récoltes et Semailles: I can illustrate the second approach with the same image of a nut to be opened. The first analogy that came to my mind is of immersing the nut in some softening liquid, and why not simply water? From time to time you rub so the liquid penetrates better, and otherwise you let time pass. The shell becomes more flexible through weeks and months—when the time is ripe, a touch of the hand is enough, and the shell opens like a perfectly ripened avocado! A different image came to me a few weeks ago. The unknown thing to be known appeared to me as some stretch of earth or hard marl, resisting penetration. One can go at it with pickaxes or crowbars or even jackhammers: this is the first approach, that of the “chisel” (with or without a hammer). The other is the sea. The sea advances insensibly and in silence, nothing seems to happen, nothing moves, the water is so far off you hardly hear it… yet it finally surrounds the resistant substance. lifts the complexity and abstraction of the data’s representationChris Olah, cofounder of Anthropic and the lead of it’s interpretability research wrote an excellent article on how software 2.0’s representation learning loosely correspond to software 1.0’s types in Neural Networks, Types, and Functional Programming.
Together, these two aspects of learning non-linear, representations form the essence of deep learning.
Let’s now turn out attention to the function bodies of these ’s with a deep neural netork of hidden layer, carrying out the task of price regression and sentiment classification so that has the form . With the statistical learning foundation from part one, we will simply present the forward pass , the loss function , and the backward pass .
Forward Pass
The functions of deep neural networks will take the form of with being a linear classifier on feature extractor
where is the number of compositional layers, and each intermediate function has the form of .
Loss Function
Backward Pass
TODO
- figure/diagram/lecun circuits ->algebraic/symbolic equations->torch code
- first train net for price regression and classification
- intuition of automating feature engineering
- XOR: playground.tensorflow
- change code below to XOR
- mention that 2.1 treats backward pass as a black box, which is the magic of the abstraction
- mention to readers that if they want to, they can read 2.1.3, 2.2, and then back to 2.1.4
- first train net for price regression and classification
TODO: generate prose/exposition by repaging lecture/text in ram
class MLP(nn.Module):
"""
takes the previous block_size tokens, encodes them with a lookup table,
concatenates the vectors and predicts the next token with an MLP.
Reference:
Bengio et al. 2003 https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
"""
def __init__(self, config):
super().__init__()
self.block_size = config.block_size
self.vocab_size = config.vocab_size
self.wte = nn.Embedding(config.vocab_size + 1, config.n_embd) # token embeddings table
# +1 in the line above for a special <BLANK> token that gets inserted if encoding a token
# before the beginning of the input sequence
self.mlp = nn.Sequential(
nn.Linear(self.block_size * config.n_embd, config.n_embd2),
nn.Tanh(),
nn.Linear(config.n_embd2, self.vocab_size)
)
def get_block_size(self):
return self.block_size
def forward(self, idx, targets=None):
# gather the word embeddings of the previous 3 words
embs = []
for k in range(self.block_size):
tok_emb = self.wte(idx) # token embeddings of shape (b, t, n_embd)
idx = torch.roll(idx, 1, 1)
idx[:, 0] = self.vocab_size # special <BLANK> token
embs.append(tok_emb)
# concat all of the embeddings together and pass through an MLP
x = torch.cat(embs, -1) # (b, t, n_embd * block_size)
logits = self.mlp(x)
# if we are given some desired targets also calculate the loss
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
return logits, loss
4.4 Learning Sequences with Feedforward Neural Networks
class MLP(nn.Module):
"""
takes the previous block_size tokens, encodes them with a lookup table,
concatenates the vectors and predicts the next token with an MLP.
Reference:
Bengio et al. 2003 https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
"""
def __init__(self, config):
super().__init__()
self.block_size = config.block_size
self.vocab_size = config.vocab_size
self.wte = nn.Embedding(config.vocab_size + 1, config.n_embd) # token embeddings table
# +1 in the line above for a special <BLANK> token that gets inserted if encoding a token
# before the beginning of the input sequence
self.mlp = nn.Sequential(
nn.Linear(self.block_size * config.n_embd, config.n_embd2),
nn.Tanh(),
nn.Linear(config.n_embd2, self.vocab_size)
)
def get_block_size(self):
return self.block_size
def forward(self, idx, targets=None):
# gather the word embeddings of the previous 3 words
embs = []
for k in range(self.block_size):
tok_emb = self.wte(idx) # token embeddings of shape (b, t, n_embd)
idx = torch.roll(idx, 1, 1)
idx[:, 0] = self.vocab_size # special <BLANK> token
embs.append(tok_emb)
# concat all of the embeddings together and pass through an MLP
x = torch.cat(embs, -1) # (b, t, n_embd * block_size)
logits = self.mlp(x)
# if we are given some desired targets also calculate the loss
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
return logits, loss
4.5 Learning Sequences with Convolutional Neural Networks
# Near copy paste of the layers we have developed in Part 3
# -----------------------------------------------------------------------------------------------
class Linear:
def __init__(self, fan_in, fan_out, bias=True):
self.weight = torch.randn((fan_in, fan_out)) / fan_in**0.5 # note: kaiming init
self.bias = torch.zeros(fan_out) if bias else None
def __call__(self, x):
self.out = x @ self.weight
if self.bias is not None:
self.out += self.bias
return self.out
def parameters(self):
return [self.weight] + ([] if self.bias is None else [self.bias])
# -----------------------------------------------------------------------------------------------
class BatchNorm1d:
def __init__(self, dim, eps=1e-5, momentum=0.1):
self.eps = eps
self.momentum = momentum
self.training = True
# parameters (trained with backprop)
self.gamma = torch.ones(dim)
self.beta = torch.zeros(dim)
# buffers (trained with a running 'momentum update')
self.running_mean = torch.zeros(dim)
self.running_var = torch.ones(dim)
def __call__(self, x):
# calculate the forward pass
if self.training:
if x.ndim == 2:
dim = 0
elif x.ndim == 3:
dim = (0,1)
xmean = x.mean(dim, keepdim=True) # batch mean
xvar = x.var(dim, keepdim=True) # batch variance
else:
xmean = self.running_mean
xvar = self.running_var
xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
self.out = self.gamma * xhat + self.beta
# update the buffers
if self.training:
with torch.no_grad():
self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar
return self.out
def parameters(self):
return [self.gamma, self.beta]
# -----------------------------------------------------------------------------------------------
class Tanh:
def __call__(self, x):
self.out = torch.tanh(x)
return self.out
def parameters(self):
return []
# -----------------------------------------------------------------------------------------------
class Embedding:
def __init__(self, num_embeddings, embedding_dim):
self.weight = torch.randn((num_embeddings, embedding_dim))
def __call__(self, IX):
self.out = self.weight[IX]
return self.out
def parameters(self):
return [self.weight]
# -----------------------------------------------------------------------------------------------
class FlattenConsecutive:
def __init__(self, n):
self.n = n
def __call__(self, x):
B, T, C = x.shape
x = x.view(B, T//self.n, C*self.n)
if x.shape[1] == 1:
x = x.squeeze(1)
self.out = x
return self.out
def parameters(self):
return []
# -----------------------------------------------------------------------------------------------
class Sequential:
def __init__(self, layers):
self.layers = layers
def __call__(self, x):
for layer in self.layers:
x = layer(x)
self.out = x
return self.out
def parameters(self):
# get parameters of all layers and stretch them out into one list
return [p for layer in self.layers for p in layer.parameters()]
# original network
# n_embd = 10 # the dimensionality of the character embedding vectors
# n_hidden = 300 # the number of neurons in the hidden layer of the MLP
# model = Sequential([
# Embedding(vocab_size, n_embd),
# FlattenConsecutive(8), Linear(n_embd * 8, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
# Linear(n_hidden, vocab_size),
# ])
# hierarchical network
n_embd = 24 # the dimensionality of the character embedding vectors
n_hidden = 128 # the number of neurons in the hidden layer of the MLP
model = Sequential([
Embedding(vocab_size, n_embd),
FlattenConsecutive(2), Linear(n_embd * 2, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
FlattenConsecutive(2), Linear(n_hidden*2, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
FlattenConsecutive(2), Linear(n_hidden*2, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear(n_hidden, vocab_size),
])
# parameter init
with torch.no_grad():
model.layers[-1].weight *= 0.1 # last layer make less confident
parameters = model.parameters()
print(sum(p.nelement() for p in parameters)) # number of parameters in total
for p in parameters:
p.requires_grad = True
4.6 Learning Sequences with Recurrent Neural Networks
class RNNCell(nn.Module):
"""
the job of a 'Cell' is to:
take input at current time step x_{t} and the hidden state at the
previous time step h_{t-1} and return the resulting hidden state
h_{t} at the current timestep
"""
def __init__(self, config):
super().__init__()
self.xh_to_h = nn.Linear(config.n_embd + config.n_embd2, config.n_embd2)
def forward(self, xt, hprev):
xh = torch.cat([xt, hprev], dim=1)
ht = F.tanh(self.xh_to_h(xh))
return ht
class RNN(nn.Module):
def __init__(self, config, cell_type):
super().__init__()
self.block_size = config.block_size
self.vocab_size = config.vocab_size
self.start = nn.Parameter(torch.zeros(1, config.n_embd2)) # the starting hidden state
self.wte = nn.Embedding(config.vocab_size, config.n_embd) # token embeddings table
if cell_type == 'rnn':
self.cell = RNNCell(config)
elif cell_type == 'gru':
self.cell = GRUCell(config)
self.lm_head = nn.Linear(config.n_embd2, self.vocab_size)
def get_block_size(self):
return self.block_size
def forward(self, idx, targets=None):
device = idx.device
b, t = idx.size()
# embed all the integers up front and all at once for efficiency
emb = self.wte(idx) # (b, t, n_embd)
# sequentially iterate over the inputs and update the RNN state each tick
hprev = self.start.expand((b, -1)) # expand out the batch dimension
hiddens = []
for i in range(t):
xt = emb[:, i, :] # (b, n_embd)
ht = self.cell(xt, hprev) # (b, n_embd2)
hprev = ht
hiddens.append(ht)
# decode the outputs
hidden = torch.stack(hiddens, 1) # (b, t, n_embd2)
logits = self.lm_head(hidden)
# if we are given some desired targets also calculate the loss
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
return logits, loss
4.7 Learning Sequences with Generative Pretrained Transformers
class NewGELU(nn.Module):
"""
Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT).
Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415
"""
def forward(self, x):
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
class CausalSelfAttention(nn.Module):
"""
A vanilla multi-head masked self-attention layer with a projection at the end.
It is possible to use torch.nn.MultiheadAttention here but I am including an
explicit implementation here to show that there is nothing too scary here.
"""
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# key, query, value projections for all heads, but in a batch
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
# output projection
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
# causal mask to ensure that attention is only applied to the left in the input sequence
self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size))
self.n_head = config.n_head
self.n_embd = config.n_embd
def forward(self, x):
B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
# calculate query, key, values for all heads in batch and move head forward to be the batch dim
q, k ,v = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
# causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
# output projection
y = self.c_proj(y)
return y
class Block(nn.Module):
""" an unassuming Transformer block """
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = nn.ModuleDict(dict(
c_fc = nn.Linear(config.n_embd, 4 * config.n_embd),
c_proj = nn.Linear(4 * config.n_embd, config.n_embd),
act = NewGELU(),
))
m = self.mlp
self.mlpf = lambda x: m.c_proj(m.act(m.c_fc(x))) # MLP forward
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlpf(self.ln_2(x))
return x
class Transformer(nn.Module):
""" Transformer Language Model, exactly as seen in GPT-2 """
def __init__(self, config):
super().__init__()
self.block_size = config.block_size
self.transformer = nn.ModuleDict(dict(
wte = nn.Embedding(config.vocab_size, config.n_embd),
wpe = nn.Embedding(config.block_size, config.n_embd),
h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
ln_f = nn.LayerNorm(config.n_embd),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
# report number of parameters (note we don't count the decoder parameters in lm_head)
n_params = sum(p.numel() for p in self.transformer.parameters())
print("number of parameters: %.2fM" % (n_params/1e6,))
def get_block_size(self):
return self.block_size
def forward(self, idx, targets=None):
device = idx.device
b, t = idx.size()
assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}"
pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)
# forward the GPT model itself
tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
x = tok_emb + pos_emb
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
logits = self.lm_head(x)
# if we are given some desired targets also calculate the loss
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
return logits, loss
5. Accelerating Sequence Models on GPU in teenygrad with CUDA Rust
5.1 From Numerical Linear Algebra to Deep Learning Frameworks
5.2 Network Primitives and Optimizers with teenygrad.nn and teenygrad.optim
Consider the function where ,
and translate it to it’s computational counterpart in python with one-dimensional Tensors:
import picograd as pg
def f(x1: pg.Tensor, x2: pg.Tensor) -> pg.Tensor:
a = pg.exp(x1)
b = pg.sin(x2)
c = b**2
d = a*c
return d
Figure 1. Python source for the function where
Here we’ve broken up the function to render the subexpressions more clearly.
But this isn’t necessary — automatic differentiation will work if the function was expressed in one line.
In part one, the development of picograd followed that of numpy —
an array programming language similar to Matlab but embedded in the host language of Python,
that could evaluate functions of the form
where Tensor objects stored their values with the value: field
and the function types that produced their values with Op.
For instance, evaluating the specified function f from above with 9 and 10
if __name__ == "__main__":
print(f(9, 10))
populates the Tensor.value fields.
In part one of the book we verified this with a REPL-interface,
but we can also represent the entire expression being evaluated with a
graph of vertices and edges where the vertices are Tensors
(along with their Ops and values) and the edges are their data dependencies:
Here you can see that even if the function was specified in one line,
the graph of the expression always parses into Tensor vertices, and data dependency edges.
You may have noticed the Tensor.grad fields, which supposedly store the values of derivatives .
The question now remains in how to populate these fields.
Taking a step back to differential calculus, deriving the derivative of involves the application of the chain rule where . Evaluating the derivative of the function with respect to its inputs and results in
symbolic and numeric differentiattion symbolic differentiation has performance issues since a large unrolled expression must be constructed in order to differentiate[^0], whereas numerical differentiation has correctness issues since evaluating finite differences requires evaluating functions to a precision point resulting in numerical instability. (trace through EXAMPLE for both. talking nets widrow)
To populate the Tensor.grad fields, the simplest idea would be
to literally translate the manual derivation of the derivative into code.
The translation from math to code involves a design decision:
should we evaluate from outputs to inputs (symbolically outside-in, graphically right-to-left) or from inputs to outputs (symbolically inside-out, graphically left-to-right)?
Although the former order seems more natural with symbolic expressions,
there’s nothing illegal about the latter.
import picograd as pg
def f(x1: pg.Tensor, x2: pg.Tensor) -> pg.Tensor:
a = pg.exp(x1)
b = pg.sin(x2)
c = b**2
d = a*c
return d
# dict[f(x), f'(x)] of local derivatives (adjoints)
dd_da, dd_dc = [c, a] # d(a,c):=a*c ==> d'(a)=c, d'(c)=a
da_dx1 = pg.exp(x1) # a(x1):=exp(x1) ==> a'(x1)=exp(x1)
dc_db = 2*b # c(b):=b^2 ==> c'(b)=2b
db_dx2 = pg.cos(x2) # b(x2):=sin(x2) ==> b'(x2)=cos(x2)
# outputs to inputs: outside-in symbolically, right-to-left graphically
dd_dd = pg.Tensor(1) # base case
dd_da, dd_dc = [dd_dd*dd_da, dd_dd*dd_dc]
dd_dx1 = dd_da*da_dx1 # DONE for the x1->d path
dd_db = dd_dc*dc_db
dd_dx1 = dd_db*db_dx2 # DONE for x2->path
# inputs to outputs: inside-out symbolically, left-to-right graphically
dx1_dx1, dx2_dx2 = [pg.Tensor(1), pg.Tensor(1)] # base case
da_dx1 = da_dx1*dx1_dx1
dd_dx1 = dd_da*da_dx1 # DONE for the x1->d path
db_dx2 = db_dx2*dx2_dx2
dc_dx2 = dc_dc*db_dx2
dd_dx2 = dd_dc*dc_dx_2 # DONE for the x2->d path
Do you notice any difference in the number of evaluations between the two orders?
The outputs-to-input ordering takes 6 arithmetic operations (including the destructuring),
whereas the input-to-output ordering take 7 arithmetic operations.
This is because the former can reuse dd_dd
as a dynamic programming solution to a subproblem for the two inputs, whereas the latter cannot.
And taking a step back, we only want to reuse the output because the shape of the function
is of . Alternatively, if had type ,
then the input-to-output ordering would be able to reuse results.
This distinction is referred to as “forward-mode” vs “reverse-mode”, and reflects
the fact that for some function
the time complexity of forward-mode differentiation is proportional to ,
whereas that of forward-mode differentiation is proportional to .
If the expression graph fans-in so that , reverse-mode is preferred.
If the expression graph fans-out so that , forward-mode is preferred.
However, if we take a step with a graph-theory lens, we can see that the derivative is the sum of paths,
where each path is a product of local derivatives from the input source to the output sink.
From a combinatorics perspective, we are calculating all the possible (ors) ways (ands)
on how the inputs perturb the output. That is:
and as long as the operations along this path are associative — — then we can choose the order in how we perform these path products to minimize the number of operations. Finding the optimal ordering is an NP-hard problem because ____. For instance, if the expression graph is diamond-shaped, evaluating the derivative with forward-mode for the left-half and reverse-mode for the right-half would be more performant. In practice, we use reverse-mode as a heuristic, since most of the functions that are differentiated (so they can be optimized) in the field of machine learning are neural networks of the form
How can we generalize this into an algorithm?
All we need are 1. mappings from and 2. a topological sort
For the derivative rules, the same way that optimizing compilers implement an optimization “manually” once which then gets reused many times, the authors of deep learning frameworks also implement derivatives manually which then become reused many times through automatic differentiation. In theory, we can differentiate any expression with f’(x) with only a few derivative rules for addition and multiplication, but in practice most frameworks provide sugar for complex derivatives.
For topological sort, we can simply reversed the ordering produced by a depth-first-search:
def toposort(self):
order: list[Op] = []
visited: set[Op] = set()
def dfs(node: Op) -> None:
if node in visited: return
visited.add(node)
for src in node.src: dfs(src)
order.append(node)
dfs(self)
return order
class Tensor():
def backward():
for t in reversed(topo):
t.backward()
We will now use this idea to modify the interpretation of our deep learning framework to not only evaluate , but as well. This is done by dynamically overloading the operators at runtime[^0] to trace the expression graph
chain_rules = PatternMatcher([
(Pattern(OpCode.MATMUL, name="input"), lambda output_grad, input: (_____,)),
(Pattern(OpCode.MATVEC, name="input"), lambda output_grad, input: (_____,)),
(Pattern(OpCode.RECIPROCAL, name="input"), lambda output_grad, input: (-output_grad * input * input,)),
(Pattern(OpCode.SIN, name="input"), lambda output_grad, input: ((math.pi/2 - input.src[0]).sin() * output_grad,)),
(Pattern(OpCode.LOG2, name="input"), lambda output_grad, input: (output_grad / (input.src[0] * math.log(2)),)),
(Pattern(OpCode.EXP2, name="input"), lambda output_grad, input: (input * output_grad * math.log(2),)),
(Pattern(OpCode.SQRT, name="input"), lambda output_grad, input: (output_grad / (input*2),)),
(Pattern(OpCode.ADD), lambda output_grad: (1.0*output_grad, 1.0*output_grad)),
(Pattern(OpCode.MUL, name="input"), lambda output_grad, input: (input.src[1]*output_grad, input.src[0]*output_grad)),
])
class Tensor:
def _forward(self, f:Callable, *other:Tensor) -> Tensor: #extra_args=(), **kwargs)
out_tensor = evaluator.eval_uop([self, other], out_uop)
def backward(self, grad:Tensor|None=None) -> Tensor:
"""
backward performs by collecting tensors, computing gradients with automatic differentiation, and updating said tensors.
"""
# 1. collect all tensors that requires grad by topologically sorting the graph of uops and filter
all_uops = self.uop.toposort()
tensors_require_grad: list[Tensor] = [t for tref in all_tensors if (t:=tref()) is not None and t.uop in all_uops and t.requires_grad]
uops_require_grad = [t.uop for t in tensors_require_grad]
assert grad is not None or self.shape == tuple(), "when no gradient is provided, backward must be called on a scalar tensor"
if not (self.is_floating_point() and all(t.is_floating_point() for t in tensors_require_grad)): raise RuntimeError("only float Tensors have gradient")
# 2. compute the gradient with a map of tensors to partials
if grad is None: grad = Tensor(1.0, dtype=self.dtype, device=self.device, requires_grad=False) # base case is 1.0
tens2grads = Tensor._automatically_differentiate(self.uop, grad.uop, set(uops_require_grad)) # skipping materializing zerod grads for now
grads = [Tensor(g, device=t.device) for t,g in zip(tens2grads.keys, tens2grads.values)] # initialize tensor grads on device
# 3. update the tensors that require grad with the gradient's partials
for t,g in zip(tensors_require_grad, grads):
assert g.shape == t.shape, f"grad shape must match tensor shape, {g.shape!r} != {t.shape!r}"
t.grad = g if t.grad is None else (t.grad + g) # accumulate if t.grad exists
return self
@staticmethod
def _automatically_differentiate(root:Op, root_grad:Op, targets:set[Op]) -> dict[Op, Op]:
"""
_differentiate backpropagates partials on a topologically sorted expression graph with the chain rule
and produces the gradient in the form of a map of ops to their partials (which, in turn, are ops)
"""
tens2grads = {root: root_grad}
# 1. topological sort
in_target_path: dict[Op, bool] = {}
for u in root.toposort(): in_target_path[u] = any(x in targets or in_target_path[x] for x in u.src)
dfs = list(root.toposort()) # lambda node: node.op not in {OpCode.DETACH, OpCode.ASSIGN} and in_target_path[node])) # don't flow through DETACH/ASSIGN or anything not in target path
# 2. backpropagation with the chain rule
for tensor in reversed(dfs):
if tensor not in tens2grads: continue
local_grads: tuple[Op|None, ...]|None = cast(tuple[Op, ...]|None, chain_rules.rewrite(tensor, ctx=tens2grads[tensor]))
if local_grads is None: raise RuntimeError(f"failed to compute gradient for {tensor.op}\n\nin {str(tensor)[0:1000]}...")
assert len(local_grads) == len(tensor.src), f"got {len(local_grads)} gradient, expected {len(tensor.src)}"
for tensor,local_grad in zip(tensor.src, local_grads): # <--------------------- MOOOSE: why are we accumulating inside ad()? don't we do it in backward()??
if local_grad is None: continue
if tensor in tens2grads: tens2grads[tensor] = tens2grads[tensor] + local_grad # accumulate if tensor exists
else: tens2grads[tensor] = local_grad # o/w initialize
To implement automatic differentiation with Tensor.backward(),
there is a design decision to be made
— the choice of implementing it dynamically or just-in-time[^3],
similar to the decision of how to implement types for general programming languages[^4].
This stands in contrast to the alternative of performing a just-in-time, source-to-source transformation.
Let’s now move onto automatically differentiating the functions of neural networks, specifically the FFN language model from earlier. (johnson/ryan adams ordering) n^2 vs n^3
5.3 Automatic Differentiation with Tensor.forward() and Tensor.backward()
5.4 From SIMD of Multi-Core Latency-Oriented Processors to SIMT of Many-Core Throughput-Oriented Processors with PTX
5.5 Accelerating GEMV on GPU with CUDA Rust via Rooflines
#![allow(unused)]
fn main() {
// gpu_host.rs
use cudarc::{driver::{self, PushKernelArg}, nvrtc};
use src_device::T; // shared type with device code
static PTX: &str = include_str!(concat!(env!("OUT_DIR"), "/gpu_device.ptx")); // Embed the PTX code as a static string.
pub fn cudars_helloworld() -> Result<(), Box<dyn std::error::Error>> {
// initialize device context and stream via driver api
let process = driver::CudaContext::new(0)?; // device 0
let queue = process.default_stream();
// load ptx via nvrtc
let dylib = process.load_module(nvrtc::Ptx::from_src(PTX))?;
let add_kernel = dylib.load_function("add")?;
// allocate on device
let (a, b): ([T; _], [T; _]) = ([1.0, 2.0, 3.0, 4.0], [2.0, 3.0, 4.0, 5.0]);
let (a_gpu, b_gpu, mut c_gpu) = (queue.clone_htod(&a)?, queue.clone_htod(&b)?, queue.alloc_zeros::<T>(a.len())?);
let (a_len, b_len) = (a_gpu.len(), b_gpu.len());
let cfg = driver::LaunchConfig { grid_dim: (1, 1, 1), block_dim: (4, 1, 1), shared_mem_bytes: 0, };
unsafe {
queue
.launch_builder(&add_kernel).arg(&a_gpu).arg(&a_len).arg(&b_gpu).arg(&b_len).arg(&mut c_gpu)
.launch(cfg)?;
}
queue.synchronize()?;
let c = queue.clone_dtoh(&c_gpu)?;
println!("c from cuda is = {:?}", c);
Ok(())
}
}
#![allow(unused)]
fn main() {
// gpu_device.rs
use cuda_std::kernel;
use crate::T;
#[allow(improper_ctypes_definitions)]
#[kernel] pub unsafe fn add(a: &[T], b: &[T], c: *mut T) {
let i = cuda_std::thread::index_1d() as usize;
if i < a.len() {
let elem = unsafe { &mut *c.add(i) };
*elem = a[i] + b[i];
}
}
#[allow(improper_ctypes_definitions)]
#[kernel] pub unsafe fn saxpy(a: &[T], b: &[T], c: *mut T) {
let i = cuda_std::thread::index_1d() as usize;
todo!()
}
#[allow(improper_ctypes_definitions)]
#[kernel] pub unsafe fn smul(a: &[T], b: &[T], c: *mut T) {
let i = cuda_std::thread::index_1d() as usize;
todo!()
}
#[allow(improper_ctypes_definitions)]
#[kernel] pub unsafe fn stanh(a: &[T], b: &[T], c: *mut T) {
let i = cuda_std::thread::index_1d() as usize;
todo!()
}
}
5.6 Accelerating _ of GEMM on GPU with Data Reuse
5.7 Accelerating _ of GEMM on GPU with Scheduling

You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.
III. Scaling Networks
Part 3 covers the age of scaling by building a distributed tensor compiler using tinygrad IR
Table of Contents
Epilogue
To continue deepening your knowledge on deep learning and it’s systems, the following resources are a good next step. You might find this book complementary to your reading, since the three streams outlined below were woven into a single narrative for the book. Once you feel comfortable, you should graduate towards contributing to larger deep learning systems. The SITP book obviously sets you up with tinygrad, which itself is a good bridge to pytorch and jax, but at the end of the day the world is your oyster.
Good luck on your journey.
I’ll see you at work.
References
Abelson, H., & Gerald Jay Sussman. (1996). Structure and Interpretation of Computer Programs. MIT Press. https://mitp-content-server.mit.edu/books/content/sectbyfn/books_pres_0/6515/sicp.zip/index.html
Amanzhol Salykov. (2025, January 12). Advanced Matrix Multiplication Optimization on NVIDIA GPUs. Salykova. https://salykova.github.io/sgemm-gpu
Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Evgeni Burovski, Chauhan, G., Anjali Chourdia, Constable, W., Alban Desmaison, DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., & Hirsh, B. (2024). PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. ACM, ASPLOS’2024. https://doi.org/10.1145/3620665.3640366
Bach, F. (2024). Learning Theory from First Principles. MIT Press. https://www.di.ens.fr/~fbach/ltfp_book.pdf
Bakhvalov, D. (2024). Performance Analysis and Tuning on Modern CPUs. https://github.com/dendibakh/perf-book
Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability. Athena Scientific.
Bryant, R. E., & O’hallaron, D. R. (2016). Computer Systems : a Programmer’s Perspective. Pearson.
Boehm, S. (2022, December 31). How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog. Siboehm.com. https://siboehm.com/articles/22/CUDA-MMM
Bright, P., Edelman, A., & Johnson, S. G. (2025). Matrix Calculus (for Machine Learning and Beyond). ArXiv.org. https://arxiv.org/abs/2501.14787
Chan, S. H. (2021). Introduction to Probability for Data Science. Michigan Publishing. https://probability4datascience.com/
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., & Krishnamurthy, A. (2018). TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. https://doi.org/10.48550/arxiv.1802.04799
Dao, T. (2023, July 17). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. ArXiv.org. https://arxiv.org/abs/2307.08691
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022, June 23). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. ArXiv.org. https://doi.org/10.48550/arXiv.2205.14135
Darve, E., & Wootters, M. (2021). Numerical Linear Algebra with Julia. SIAM. https://ericdarve.github.io/NLA
Demmel, J. W. (1997). Applied numerical linear algebra. Society For Industrial And Applied Mathematics.
Dongarra, J., J. Du Croz, Sven Hammarling, & Duff, I. S. (1990). A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16(1), 1–17. https://doi.org/10.1145/77626.79170
Dongarra, J., J. Du Croz, Sven Hammarling, & Hanson, R. J. (1988a). Algorithm 656: An Extended Set of Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Transactions on Mathematical Software, 14(1), 18–32. https://doi.org/10.1145/42288.42292
Dongarra, J., J. Du Croz, Sven Hammarling, & Hanson, R. W. (1988b). An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 14(1), 1–17. https://doi.org/10.1145/42288.42291
Fisler, K., Krishnamurthi, S., Lerner, B. S., & Politz, J. G. (2025). A Data-Centric Introduction to Computing. Dcic-World.org. https://dcic-world.org/
Fog, A. (n.d.). Software Optimization: C++ and Assembly. Agner.org. https://agner.org/optimize/
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. The MIT Press. https://www.deeplearningbook.org/
Gordić, A. (2025, October 29). Inside NVIDIA GPUs: Anatomy of high performance matmul kernels - Aleksa Gordić. Aleksagordic.com. https://www.aleksagordic.com/blog/matmul
Goto, K., & Geijn, R. A. van de. (2008). Anatomy of High-Performance Matrix Multiplication. ACM Transactions on Mathematical Software, 34(3), 1–25. https://doi.org/10.1145/1356052.1356053
Goto, K., & Van De Geijn, R. (2008). High-Performance Implementation of the Level-3 BLAS. ACM Transactions on Mathematical Software, 35(1), 1–14. https://doi.org/10.1145/1377603.1377607
Güneş Baydin, A., Pearlmutter, B., Siskind, J., Baydin, G., Radul, A., & Mark, J. (2018). Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 18, 1–43. https://www.jmlr.org/papers/volume18/17-468/17-468.pdf
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical learning, Second Edition: Data mining, inference, and Prediction (2nd ed.). Springer. https://hastie.su.domains/ElemStatLearn/
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., & Gérard-Marchant, P. (2020). Array Programming with numpy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2
Hennessy, J. L., Patterson, D. A., & Christos Kozyrakis. (2025). Computer Architecture. Morgan Kaufmann.
Hwu, W.-M. W., Kirk, D. B., & Hajj, I. E. (2022). Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann.
James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning. Springer. https://www.statlearning.com/
Jurafsky, D., & H. Martin, J. (2026). Speech and Language Processing. Stanford.edu. https://web.stanford.edu/~jurafsky/slp3/
Klein, P. N. (2013). Coding the Matrix : Linear Algebra Through Applications to Computer Science. Newtonian Press. https://codingthematrix.com/
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Cody Hao Yu, Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. https://doi.org/10.1145/3600006.3613165
Lambert, N. (2026). RLHF Book. Rlhfbook.com. https://rlhfbook.com/
Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3), 308–323. https://doi.org/10.1145/355841.355847
Marc Peter Deisenroth, A Aldo Faisal, & Cheng Soon Ong. (2020). Mathematics for Machine Learning. Cambridge University Press. https://mml-book.github.io/book/mml-book.pdf
Minsky, M., Papert, S., & Léon Bottou. (2017). Perceptrons : An Introduction to Computational Geometry. The MIT Press.
Matthias Felleisen, Robert Bruce Findler, Flatt, M., & Shriram Krishnamurthi. (2018). How to Design Programs: An Introduction to Programming and Computing. The MIT Press. https://htdp.org/
Mitchell, T. M. (1997). Machine learning. Mcgraw-Hill. https://www.cs.cmu.edu/~tom/files/MachineLearningTomMitchell.pdf
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
Myers, M., Van De Geijn, P., & Van De Geijn, R. (2021). Linear Algebra: Foundations to Frontiers. https://www.cs.utexas.edu/~flame/laff/laff/LAFF-2.00M.pdf
Nakatsukasa, Y. (n.d.). Numerical Linear Algebra. Retrieved March 17, 2026, from https://courses.maths.ox.ac.uk/pluginfile.php/105965/mod_resource/content/35/NLA_lecture_notes.pdf
Ng, A., & Ma, T. (2023). CS229 Lecture Notes. https://cs229.stanford.edu/main_notes.pdf
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., & Bai, J. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Neural Information Processing Systems. https://papers.nips.cc/paper_files/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., & Amarasinghe, S. (2013). Halide. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. https://doi.org/10.1145/2491956.2462176
Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning.
Raschka, S. (2026). Build a Reasoning Model (From Scratch). Simon and Manning.
Roberts, D. A., Yaida, S., & Hanin, B. (2022). The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press. https://deeplearningtheory.com/
Russell, S., & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (4th ed.). Prentice Hall. https://aima.cs.berkeley.edu/
Slotin, S. (n.d.). Algorithms for Modern Hardware - Algorithmica. En.algorithmica.org. https://en.algorithmica.org/hpc/
seb-v. (2025, January 20). Optimizing Matrix Multiplication on RDNA3: 50 TFlops and 60% Faster Than rocBLAS. Seb-v. https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024, July 24). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. ArXiv.org. https://arxiv.org/abs/2407.08608
Shalizi, C. R. (n.d.). Advanced Data Analysis from an Elementary Point of View. https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
Shalizi, C. R. (2015). Modern Regression Lecture Notes. Cmu.edu. https://www.stat.cmu.edu/~cshalizi/mreg/15/
Shankhdhar, P. (2024, November 29). Outperforming cuBLAS on H100: a Worklog. Substack.com. https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog
Smith, T. M., Geijn, R. van de, Smelyanskiy, M., Hammond, J. R., & Zee, F. G. V. (2014, May 1). Anatomy of High-Performance Many-Threaded Matrix Multiplication. IEEE Xplore. https://doi.org/10.1109/IPDPS.2014.110
Spector, B., Singhal, A., Arora, S., & Re, C. (2024, May 12). GPUs Go Brrr. Stanford.edu. https://hazyresearch.stanford.edu/blog/2024-05-12-tk
Spector, B. F., Arora, S., Singhal, A., Fu, D. Y., & Ré, C. (2024, October 27). ThunderKittens: Simple, Fast, and Adorable AI Kernels. ArXiv.org. https://arxiv.org/abs/2410.20399
Sul, S., & Ré, C. (2026). ThunderKittens 2.0: Even Faster Kernels for Your GPUs. Stanford.edu. https://hazyresearch.stanford.edu/blog/2026-02-19-tk-2
Spector, B., Juravsky, J., Sul, S., Dugan, O., Lim, D., Fu, D., Arora, S., & Ré, C. (2025). Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B. Stanford.edu. https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles
Spector, B., Juravsky, J., Sul, S., Lim, D., Dugan, O., Arora, S., & Ré, C. (2025). We Bought the Whole GPU, So We’re Damn Well Going to Use the Whole GPU. Stanford.edu. https://hazyresearch.stanford.edu/blog/2025-09-28-tp-llama-main
Stillwell, J. (2010). Mathematics and its history. Springer New York.
Strang, G. (2023). Introduction to Linear Algebra. Wellesley-Cambridge Press. https://math.mit.edu/~gs/linearalgebra/
Sutton, R. S., & Barto, A. (2018). Reinforcement learning: An introduction (2nd ed.). The MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Tillet, P., Kung, H.-T., & Cox, D. G. (2019). Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. https://doi.org/10.1145/3315508.3329973
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. Society For Industrial And Applied Mathematics.
Valiant, L. (2014). Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World. Basic Books, A Member Of The Perseus Books Group.
Wasserman, L. (2010). All of Statistics. Springer.
Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., & Sheng, Y. (2023). SGLang: Efficient Execution of Structured Language Model Programs. ArXiv.org. https://arxiv.org/abs/2312.07104
Zadouri, T., Hoehnerbach, M., Shah, J., Liu, T., Thakkar, V., & Dao, T. (2026). FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling. ArXiv.org. https://arxiv.org/abs/2603.05451