Euclid Instructing His Pupils in The School of Athens, Raffaello Sanzio da Urbino 1509-1511.

You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.

I. Elements of Networks

Although separated by over 2000 years, the discipline of programming originally practiced in Silicon Valley shares the same dream of mathematics in Ancient Greece: using discrete and deterministic descriptions to model reality. Although mathematicians prove properties about mathematical objects with theorems whereas programmers evaluate the values of said objects with programs, what the two traditions share in common is the activity of modeling discrete objects with deterministic descriptions which in turn, are created for them by toolsmiths referred to as logicians and language implementors — the mathematician describes the world with a logic, and the programmer does so with a programming language.

However, for the programmer interested in learning the skills necessary for the statistical learning approach to artificial intelligence, they must make a transition to programming from discrete data usinand deterministic functions to that of continuous data $x$ and stochastic functions $f (x)$ . Although you will encounter many new concepts in your journey up ahead, there are also many similarities with the art of programming that you already know.

So in 1 Continuously Stochastic Data $x$ with numpy, we will first learn about programming data $x$ of software 2.0 by representing types continuously with coordinates, distributing truth stochastically with probability, and then recovering said distributions with maximum likelihood estimation. Then, in 1.2 Statistical Learning of Continuously Stochastic Functions with numpy, we will learn how to recover stochastic functions $y = f (x)$ by automatically evaluating their derivatives $f^{'} (x)$ with HIPS/autograd HIPS/autograd along with torch-autograd were the first numerical libraries to implement automatic differentiation in the ecosystems of productive programming languages like Python's numpy and Lua's torch, which ultimately inspired PyTorch. HIPS/autograd is used in part one of the book to emphasize the automatic differentiation abstraction was implemented before the heavy GPU acceleration PyTorch provided. For more information Soumith's PyTorch's design origins is an excellent read. and using it's direction of steepest descent to iteratively optimize the loss with stochastic gradient descent. From there, readers will unravel the mystery in how the internals of numpy and torch are implemented by building teenygrad from scratch. So in 1.3 Programming CPU Accelerated Basic Linear Algebra Subroutines with teenygrad.Tensor, teenygrad.Tensor is implemented in Python with strides that virtualize arbitrary n-dimensinal shapes to a 1-dimensional physical array. Then, the basic linear algebra operations on the Tensor are accelerated on the CPU by changing the host language from Python to Rust in order to translate programs into native machine code called assembly with the rustc compiler which enables us to analyze and tune the performance for the underlying processor's pipelines and memory hierarchiesBerkeley's opensource educational RISC-V cores sodor and BOOM cores along with Tenstorrent's RVV extension fork ocelot will be used to introduce instruction sets and microarchitecture, but x86 machines like Intel's Xeon and AMD's EPYC will be used in order to achieve server-class performance.

From there, part two of the SITP book takes you on the next step in the journey by increasing the expressivity of the models that are trained from generalized linear models to deep neural networks with torchIn part three, teenygrad will be updated once again for SITP's ultimate capstone project which modifies the eager mode interpreter implementation to a graph mode compiler in order to support the nanogpt and nanochat models. This final version of teenygrad will share 80% of the abstractions with the pure fusion compiler called tinygrad, providing a bridge to the frontier of deep learning systems.. and then evolving teenygrad's implementation to support the same neural network primitives, with GPU acceleration.

WeThe following is a software2.0-adapted excerpt from The Structure and Interpretation of Computer Programs Chapter 1: Building Abstractions with Procedures: are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of ~~rules called a program~~ probabilities called a model. People ~~create programs to direct processes~~ recover models to direct processes. In effect, we conjure the spirits of the computer with our spells.

A computational process is indeed much like a sorcerer's idea of a spirit. It cannot be seen or touched. It is not composed of matter at all. However, it is very real. It can perform intellectual work. It can answer questions. It can affect the world by disbursing money at a bank or by controlling a robot arm in a factory. The programs we use to conjure processes are like a sorcerer's spells. They are carefully composed from ~~symbolic~~ numerical expressions in arcane and esoteric programming languages that prescribe the tasks we want our processes to perform. A computational process, in a correctly working computer, ~~executes programs precisely and accurately~~ recovers models stochastically and noisily. Thus, like the sorcerer's apprentice, novice programmers must learn to understand and to anticipate the consequences of their conjuring. Even small errors (usually called bugs or glitches) in programs can have complex and unanticipated consequences.

I. Table of Contents

1. Representing Data with High Dimensional Stochasticity in `torch`

1.1 From Software 1.0 to Software 2.0

$↩$ Table of Contents

Historically speaking, the first ways in which artificially intelligent systems (like ELIZAPresented in A Computer Program For the Study of Natural Language Communication Between Man and Machine. (Weizenbaum 1966)) initially represented the meaning of words (referred to as semantics) was to simply use strings of letters (or an index into a list of strings of letters), which are discrete representationlocalist representationdiscrete localist representations. The classic and illustrative example from linguist Barbara Partee (1967) is that the meaning of life is (represented with) "LIFE". There is something left to be said about this representation of word meaning, namely, that it not ameanble to encoding multiple meanings (referred to as concepts or sensesenses) with a single word (referred to as lemmalemma). The classic example is that of WordNet, available through the Natural Language Toolkit Python package nltk, in which we can evaluate the many senses of the lemma "life":

# use wasm-based pyodide.http to fetch wordnet instead of nltk.download() which uses urllib
# see: https://pyodide.org/en/stable/usage/api/python-api/http.html
import os, pyodide.http
url = "https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip"
path = "/nltk_data/corpora"
os.makedirs(path, exist_ok=True)
response = await pyodide.http.pyfetch(url)
with open(path+"/wordnet.zip", "wb") as f: f.write(await response.bytes())

from nltk.corpus import wordnet as wn
for sense in wn.synsets("life"):
    print(f"{sense.name()}: {sense.definition()}")

In addition to the key-value dictionary-like capability of WordNet, another desiderata for a useful representation of word meaning is to provide thesaurus-like capability of evaluating linguistic relationships between word senses such as equivalance relationships like synonymsynonymyMoreso a rough approximate equivalence, since any difference in linguistic form implies difference in meaning, stated more formally as the principle of contrast (Girard 1718, Bréal 1897, Clark 1987)., and IS-A relationships like hypernymhypernymy. Again through nltk, we can evaluate WordNet's list of synonyms and hypernyms for the word "life":

# use wasm-based pyodide.http to fetch wordnet instead of nltk.download() which uses urllib
# see: https://pyodide.org/en/stable/usage/api/python-api/http.html
import os, pyodide.http
url = "https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip"
path = "/nltk_data/corpora"
os.makedirs(path, exist_ok=True)
response = await pyodide.http.pyfetch(url)
with open(path+"/wordnet.zip", "wb") as f: f.write(await response.bytes())

from nltk.corpus import wordnet as wn
life = wn.synsets("life")
hypernyms_of_life = [h for s in life for h in s.hypernyms()]
print("synonyms of life:", wn.synonyms("life"))
print("hypernyms of life:", [h.lemma_names() for h in hypernyms_of_life])

Another notion developed in the field of lexical semantics is that of similaritysimilarity, which is defined between words rather than word senses. For instance, even though the senses of "life" and "love" are very different, they are somewhat related as words. Once more through nltk, we can evaluate how similar "life" and love" with u.path_similarity(v), which evaluates the distance of the shortest path between u and v in their hypernym taxonomy graphs:

# use wasm-based pyodide.http to fetch wordnet instead of nltk.download() which uses urllib
# see: https://pyodide.org/en/stable/usage/api/python-api/http.html
import os, pyodide.http
url = "https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip"
path = "/nltk_data/corpora"
os.makedirs(path, exist_ok=True)
response = await pyodide.http.pyfetch(url)
with open(path+"/wordnet.zip", "wb") as f: f.write(await response.bytes())

from nltk.corpus import wordnet as wn
life = wn.synset('life.n.01')
love = wn.synset('love.n.01')
print("similarity between life and love", life.path_similarity(love))

Although representations such as WordNet are a major improvement over those such as string of characters like "LIFE", overall the localist symbolic representations used by rules-based expert systems like ELIZA (known as good old fashioned artificially intelligence GOFAIGOFAI) are brittle and do not generalizationgeneralize well.

distributional semanticsdistributional semantics by (Harris 1954), and (Firth 1957) "You shall know a word by the company it keeps" vector semanticsvector semantics (Osgood 1957) using a basis of valence, arousal, dominance

To build SITP's capstone model projectWhich you will train once with torch and a second with teenygrad, SITP's capstone framework project of the large language model nanochat created by Andrej Karpathy, the model represents and learns word data with high dimensional stochasticity. To prepare you for such representation learning in Part 2, in Chapter 1 you will construct the mathematical machinery to understand the following fundamental question:

Q: How do language models represent the meaning of words?
APredicted by Philosophical Investigations (Wittgenstein 1953).: As high dimensional random vectors $X = x$ in coordinate spaces $R^{d}$

and use such mathematical machinery to reproduce Tensorflow's Embedding Projector below by programming a learning algorithm for dimensionality reduction with subspace learning via principal component analysis

So in the first three subchapters proper (notwithstanding the current chapter) we place our focus on probability theory, which is used to represent uncertainty around non-deterministic — stochastic — data following the principle of distributional semantics by (Harris 1954) and (Firth 1957):

thereafter, the next four chapters introduces linear algebra, which is used to represent data in higher dimensions following the principle of vector semantics by (Osgood et al. 1957), and programming our first learning algorithm of dimensionality reduction

1.5 Learning Probabilities from Data with Parameter Estimation via Maximum Likelihood

$↩$ Table of Contents

1.6 Representing Data Dimensionally with Random Vectors of Coordinate Spaces

$↩$ Table of Contents

Q: How do language models represent the meaning of words?
A: As high dimensional random vectors $X = x$ in coordinate spaces $R^{d}$

Recall Chapter 1's guiding question from above. After the previous three chapters, with ngram language models, you now have a better understanding of the first difference and required transition between software 1.0 and software 2.0 which is to move from deterministic localist representations of ELIZA to the stochastic distributional representations of ChatGPT, following the principle of distributional semantics by (Harris 1954) and (Firth 1957).

In the next three chapters, you will come to have a better understanding of the second difference and transition which is to move from the discrete symbolic representations of ELIZA (illustrated by WordNet) strings, dictionaries, and graphs to the high dimensional representations of ChatGPT, following the principle of vector semantics by (Osgood et al. 1957).

For this, we introduce the word meaning representations learned by word2vec, in which the Tensorflow Embedding Projector below projects the coordinates from a 200-dimensional space down to 3-dimensional space for our cortex to visualize. In the remaining part of Chapter 1, our focus is on reproducing the learning algorithm that the embedding projector is performing, which is the unsupervised learning of a subspace using the dimensionality reduction of principal component analysis — for this, we introduce the wonderful language of linear algebra.

Are you ready to begin?

vectorvector $x$ Grant Sanderson's 3Blue1Brown Essence of Linear Algebra Chapter 1: Vectors

euclidean spaceeuclidean space $R^{d}$

import numpy as np
king_D = np.array([-0.60581, -0.00148, 0.33867, 0.34192, 0.14056, 0.12235, 0.29698, 0.01794, -0.32272, 0.30459, 0.46147, 0.42833, 0.35232, 0.19600, 0.07191, -0.31273, -0.35806, 0.17491, 0.15752, -0.38974, 0.15785, -0.11237, 0.16996, 0.07034, -0.29624, -0.20062, -0.11649, 0.23958, 0.12123, -0.05922, -0.25320, -0.15877, -0.04878, -0.17475, 0.17761, -0.17073, -0.04392, 0.49176, 0.20456, 0.71964, -0.24501, 0.23090, -0.20057, 0.29144, 0.34997, -0.47062, 0.13777, 0.08265, 0.38853, 0.02435, -0.15429, -0.17934, -0.06365, -0.12415, -0.10394, -0.18429, -0.35483, -0.21525, 0.05942, 0.19793, 0.09521, -0.24977, -0.33427, 0.04912, -0.24623, 0.43333, -0.12799, 0.28146, 0.04764, 0.21654, -0.27160, 0.61729, -0.05107, -0.30553, 0.29249, -0.15464, -0.14522, 0.24393, -0.18919, -0.43948, 0.52454, 0.22005, -0.44827, -0.39906, 0.40726, -0.21335, -0.07971, -0.28354, 0.11108, 0.01481, -0.37292, 0.17809, 0.03488, -0.13215, 0.17253, -0.11791, -0.13976, -0.23828, 0.22592, -0.41355, 0.12601, 0.83641, 0.18459, -0.00347, 0.28769, 0.10095, 0.19476, 0.50792, 0.04113, 0.16390, -0.23309, 0.42694, -0.06147, 0.07271, 0.07544, -0.04256, -0.04545, -0.27834, 0.24494, -0.16883, 0.15000, -0.08949, -0.05867, -0.00903, 0.14529, 0.24702, -0.18937, -0.28886, -0.24008, -0.04227, -0.37717, -0.28747, 0.26174, -0.35527, 0.03815, -0.00708, 0.09371, 0.29560, 0.41579, 0.13900, 0.04153, -0.12024, -0.06554, -0.11634, -0.07988, 0.20049, -0.17647, -0.00582, 0.42376, -0.15636, 0.27413, -0.11051, 0.00478, -0.30303, 0.10362, -0.27428, 0.08752, -0.15209, 0.41466, -0.04700, -0.21136, 0.11834, -0.16985, 0.20846, 0.20791, -0.12102, 0.00275, 0.24892, 0.11430, -0.20873, -0.07238, -0.05436, -0.41899, -0.62990, 0.23796, -0.06399, -0.49256, 0.19141, -0.37416, -0.34462, 0.25972, -0.51707, -0.46258, 0.10827, 0.16233, 0.08140, 0.22514, -0.26821, -0.59703, -0.03896, 0.04064, -0.24537, -0.12389, 0.16034, -0.30174, -0.15281, -0.33583, -0.25088, 0.15511, 0.08964], dtype=np.float32)
print("king_D.shape", king_D.shape)
print("king_D.storage", king_D)

matrix vector multiplicationmatrix multiplication $y = Ax$

data matrixdata matrix $X$

import numpy as np
try:
    from jaxtyping import Float, jaxtyped
    from beartype import beartype
except ImportError:
    import typing
    class _Subscriptable:
        def __class_getitem__(cls, _): return typing.Any
    Float = _Subscriptable
    jaxtyped = lambda typechecker=None: (lambda fn: fn)
    beartype = lambda fn: fn

# indices    (N=10000): https://raw.githubusercontent.com/tensorflow/embedding-projector-standalone/master/oss_data/word2vec_10000_200d_labels.tsv
# embeddings (D=200): https://raw.githubusercontent.com/tensorflow/embedding-projector-standalone/master/oss_data/word2vec_10000_200d_tensors.bytes
dict_string2dist: dict[str, np.ndarray] = {
  "king":    np.array([-0.60581, -0.00148, 0.33867, 0.34192, 0.14056, 0.12235, 0.29698, 0.01794, -0.32272, 0.30459, 0.46147, 0.42833, 0.35232, 0.19600, 0.07191, -0.31273, -0.35806, 0.17491, 0.15752, -0.38974, 0.15785, -0.11237, 0.16996, 0.07034, -0.29624, -0.20062, -0.11649, 0.23958, 0.12123, -0.05922, -0.25320, -0.15877, -0.04878, -0.17475, 0.17761, -0.17073, -0.04392, 0.49176, 0.20456, 0.71964, -0.24501, 0.23090, -0.20057, 0.29144, 0.34997, -0.47062, 0.13777, 0.08265, 0.38853, 0.02435, -0.15429, -0.17934, -0.06365, -0.12415, -0.10394, -0.18429, -0.35483, -0.21525, 0.05942, 0.19793, 0.09521, -0.24977, -0.33427, 0.04912, -0.24623, 0.43333, -0.12799, 0.28146, 0.04764, 0.21654, -0.27160, 0.61729, -0.05107, -0.30553, 0.29249, -0.15464, -0.14522, 0.24393, -0.18919, -0.43948, 0.52454, 0.22005, -0.44827, -0.39906, 0.40726, -0.21335, -0.07971, -0.28354, 0.11108, 0.01481, -0.37292, 0.17809, 0.03488, -0.13215, 0.17253, -0.11791, -0.13976, -0.23828, 0.22592, -0.41355, 0.12601, 0.83641, 0.18459, -0.00347, 0.28769, 0.10095, 0.19476, 0.50792, 0.04113, 0.16390, -0.23309, 0.42694, -0.06147, 0.07271, 0.07544, -0.04256, -0.04545, -0.27834, 0.24494, -0.16883, 0.15000, -0.08949, -0.05867, -0.00903, 0.14529, 0.24702, -0.18937, -0.28886, -0.24008, -0.04227, -0.37717, -0.28747, 0.26174, -0.35527, 0.03815, -0.00708, 0.09371, 0.29560, 0.41579, 0.13900, 0.04153, -0.12024, -0.06554, -0.11634, -0.07988, 0.20049, -0.17647, -0.00582, 0.42376, -0.15636, 0.27413, -0.11051, 0.00478, -0.30303, 0.10362, -0.27428, 0.08752, -0.15209, 0.41466, -0.04700, -0.21136, 0.11834, -0.16985, 0.20846, 0.20791, -0.12102, 0.00275, 0.24892, 0.11430, -0.20873, -0.07238, -0.05436, -0.41899, -0.62990, 0.23796, -0.06399, -0.49256, 0.19141, -0.37416, -0.34462, 0.25972, -0.51707, -0.46258, 0.10827, 0.16233, 0.08140, 0.22514, -0.26821, -0.59703, -0.03896, 0.04064, -0.24537, -0.12389, 0.16034, -0.30174, -0.15281, -0.33583, -0.25088, 0.15511, 0.08964], dtype=np.float32),
  "queen":   np.array([-0.30776, 0.03691, -0.18558, 0.31012, 0.11950, 0.27827, 0.55683, 0.18181, -0.58425, 0.17935, 0.64690, 0.77510, 0.28613, 0.41206, -0.37302, -0.00877, -0.04482, 0.09054, -0.34902, -0.08987, 0.21297, -0.54038, 0.03983, -0.09468, -0.39881, -0.14433, -0.32549, 0.23944, 0.03030, -0.25032, -0.08188, -0.02524, 0.01776, -0.07530, -0.03132, -0.37621, -0.01467, 0.04420, 0.18076, 0.49967, -0.46510, 0.38056, -0.03662, 0.31360, -0.12388, -0.69273, 0.45250, 0.39414, 0.10382, -0.18083, -0.29114, -0.19337, -0.04503, 0.25394, -0.22558, -0.23089, -0.25173, -0.13796, 0.10233, 0.20276, -0.04811, -0.43686, -0.44864, -0.26512, -0.06243, 0.07716, -0.12606, 0.41493, -0.30450, -0.14465, 0.21208, 0.39148, 0.03268, -0.13662, 0.19926, -0.33510, 0.21867, -0.35820, -0.18970, -0.09537, 0.12245, 0.38925, -0.38757, -0.30637, 0.54951, 0.01394, -0.08508, -0.24813, -0.10594, 0.12171, -0.50957, 0.44566, -0.08242, -0.02203, 0.21989, -0.14438, -0.05283, 0.14035, 0.26419, -0.24030, 0.22532, 1.33754, 0.37092, 0.05798, 0.30040, 0.07502, 0.34760, 0.33096, -0.04410, 0.12008, -0.04259, 0.40471, 0.02082, 0.25745, 0.17993, -0.24099, -0.06470, -0.05714, 0.27449, -0.05952, 0.18110, 0.20618, -0.27499, 0.17484, 0.28749, 0.00435, -0.16284, -0.12278, -0.19078, -0.58445, -0.14534, -0.21705, -0.09056, -0.42018, -0.43452, 0.23021, 0.12230, -0.09652, 0.26512, -0.12037, -0.07013, 0.12415, 0.07769, -0.25398, -0.01356, 0.31983, 0.00162, 0.13570, 0.63584, 0.36664, 0.58859, -0.20397, -0.03051, -0.16674, 0.16519, -0.06899, 0.28424, -0.17868, 0.24528, -0.11615, -0.05568, 0.06143, -0.24224, 0.40372, 0.27728, -0.11411, -0.19426, 0.55504, 0.20043, -0.56549, -0.28225, 0.40233, -0.52753, -0.51510, 0.23187, 0.02527, -0.11301, 0.37002, -0.23588, -0.39991, -0.04561, -0.57695, 0.13713, -0.09015, 0.20224, 0.47651, -0.11756, -0.06060, -0.42495, -0.03794, 0.19471, -0.48927, 0.05144, 0.07654, -0.36900, -0.04806, -0.13519, -0.14509, 0.39243, -0.11192], dtype=np.float32),
  "man":     np.array([-0.11405, 0.08982, 0.15057, 0.08777, -0.21986, -0.04806, 0.27677, -0.35154, -0.29647, 0.22280, 0.39584, 0.45867, 0.03449, 0.18667, 0.11873, -0.03747, -0.18381, 0.03401, 0.06344, 0.11479, 0.15629, -0.28236, 0.18581, 0.16097, 0.02858, -0.25373, -0.15130, 0.16892, -0.04158, -0.50097, 0.41814, 0.04884, 0.03952, 0.02794, 0.06206, -0.39863, 0.00382, 0.09441, 0.04298, 0.09295, -0.06042, -0.09560, -0.00657, -0.19870, -0.09326, -0.55155, -0.07719, 0.00306, 0.03371, 0.17530, 0.00693, 0.04368, 0.14678, -0.07584, -0.39679, -0.40624, 0.08573, -0.13649, 0.65182, -0.02761, 0.38170, 0.15837, 0.10637, -0.04350, -0.36671, 0.31567, 0.18769, 0.34040, -0.14083, 0.33893, -0.22524, 0.40489, -0.02898, -0.50082, 0.31294, -0.66268, 0.19765, 0.06656, -0.01488, -0.01797, 0.41898, -0.12429, -0.15202, -0.48976, 0.22793, 0.15720, -0.21968, 0.15553, 0.00309, 0.06632, -0.41443, 0.32485, -0.23725, -0.16549, -0.11508, 0.02983, 0.21747, 0.06538, 0.01130, -0.24656, -0.06608, 0.32183, 0.06266, -0.13059, 0.10060, -0.06931, 0.12669, 0.41218, 0.27707, 0.18196, 0.21397, 0.60289, -0.04913, 0.18614, -0.04532, 0.02118, 0.00050, -0.06390, -0.01631, 0.16685, 0.40851, -0.10562, -0.00391, -0.14963, 0.23022, 0.05503, 0.13441, -0.35875, -0.32551, -0.00056, 0.17382, -0.29965, 0.01298, -0.23947, -0.08002, -0.00086, 0.16424, 0.05363, 0.01317, 0.13681, 0.04181, -0.13414, -0.00148, -0.18891, 0.22703, 0.34665, -0.06275, 0.01458, -0.00501, 0.32213, 0.16952, 0.01258, -0.05818, 0.02961, -0.06529, -0.31424, -0.21646, -0.05419, 0.08787, -0.03145, 0.07413, 0.07875, -0.05121, -0.06535, -0.24616, 0.00340, 0.04033, 0.21205, 0.16022, -0.05507, 0.20916, 0.02536, -0.19208, -0.02454, 0.13160, 0.02777, -0.16148, -0.11947, -0.05418, -0.24062, 0.03667, 0.23928, -0.08993, -0.28687, -0.00904, 0.31650, 0.47483, -0.04892, -0.26943, -0.17197, 0.16030, -0.12785, -0.46808, 0.06121, -0.16568, -0.02462, -0.04194, -0.05533, -0.05638, -0.14601], dtype=np.float32),
  "woman":   np.array([-0.20223, 0.40869, -0.01241, -0.06522, 0.01981, 0.53251, 0.51102, -0.28303, -0.10660, 0.38697, 0.10596, 0.68116, 0.10824, 0.41259, -0.02591, -0.00141, 0.24525, 0.30996, 0.17783, 0.37466, 0.17955, -0.21467, 0.19712, 0.00547, 0.18816, -0.32502, -0.15192, 0.21070, -0.33975, -0.25808, 0.22532, 0.02078, -0.11179, -0.41784, 0.13709, -0.52777, -0.27931, -0.35331, -0.10013, 0.42053, -0.25775, 0.17448, 0.03469, 0.30621, -0.11002, -0.46523, -0.07408, -0.04986, 0.28117, 0.17014, -0.09855, 0.29852, 0.16785, -0.08486, -0.55263, -0.82880, 0.02791, -0.15857, 0.42570, 0.07836, 0.09784, -0.03557, 0.22153, -0.22919, -0.29406, 0.14334, 0.23074, 0.31887, -0.48970, 0.04008, 0.37743, 0.22198, -0.22764, -0.31246, 0.42123, -0.68858, 0.35948, 0.05508, -0.18357, -0.39036, 0.44232, 0.04077, 0.05094, -0.50568, 0.49748, 0.24088, -0.37117, 0.10632, -0.21735, 0.07567, -0.44182, 0.70356, -0.13529, -0.15779, -0.11706, -0.30445, 0.23941, 0.06640, -0.04049, 0.04920, 0.10792, 0.52791, 0.31310, -0.37483, 0.15181, 0.10099, 0.06929, 0.23287, 0.57277, -0.17379, 0.61427, 0.43647, 0.12352, 0.08537, 0.24576, -0.13813, -0.11501, -0.19876, -0.23390, 0.38711, 0.45066, -0.04466, -0.02871, 0.03768, 0.14445, -0.13899, 0.44499, -0.32080, -0.43134, -0.11004, -0.13632, -0.13255, 0.06400, -0.21071, -0.33964, 0.35397, 0.39092, -0.04286, -0.12524, 0.02801, -0.23665, -0.24559, 0.09938, -0.01927, -0.15342, 0.30425, -0.10204, -0.20258, -0.09698, 0.46590, 0.23120, -0.10868, -0.08905, -0.05100, 0.29452, -0.26041, -0.17525, 0.10899, 0.21647, -0.24245, 0.26683, 0.08385, 0.04004, 0.13849, -0.40952, 0.18126, -0.25566, 0.37063, 0.04051, -0.28076, -0.05131, 0.46176, -0.26021, -0.38150, 0.12434, -0.18730, -0.06230, 0.04172, 0.02004, -0.18781, 0.01769, 0.27388, -0.12347, -0.17349, -0.23045, 0.34026, -0.12734, -0.09767, -0.22604, -0.12463, 0.02584, -0.28284, -0.19144, -0.03827, -0.26933, 0.18254, -0.15987, -0.52381, 0.10156, -0.43945], dtype=np.float32),
  "paris":   np.array([-0.18519, -0.01360, -0.32597, 0.45632, 0.30088, 0.20468, -0.27163, 0.25304, -0.39749, 0.07425, 0.63972, 0.69581, -0.19967, 0.19906, -0.05836, 0.01783, -0.19658, 0.49463, -0.43111, -0.59413, 0.34516, 0.14377, -0.13093, 0.10133, 0.00088, -0.35633, 0.20741, 0.06952, -0.16522, 0.18396, 0.03026, 0.04270, -0.38351, -0.28466, 0.31205, -0.18801, -0.06391, -0.38938, -0.20132, 0.33786, -0.26304, 0.00999, 0.29849, 0.04288, -0.32341, -0.31335, -0.00479, -0.14309, 0.11750, -0.11457, 0.36283, -0.18785, 0.12942, 0.13227, -0.31007, -0.05142, 0.05190, -0.05691, 0.26242, 0.39579, 0.51209, 0.16048, -0.13767, -0.36448, -0.34478, 0.41142, -0.21112, 0.59158, -0.17962, 0.25776, 0.38184, 0.70665, 0.33705, -0.03129, 0.56412, -0.38605, -0.17437, 0.07264, -0.32980, -0.30586, 0.46573, 0.41541, -0.04550, -0.79480, 0.36952, -0.05179, -0.02684, -0.26586, -0.21004, -0.24198, -0.66758, 0.51680, 0.00700, -0.47347, -0.63521, -0.46611, 0.24545, 0.24413, -0.33283, -0.29844, 0.10242, 0.71519, 0.20355, -0.09181, 0.43851, 0.31301, -0.21880, 0.11714, 0.13327, -0.08256, 0.13364, -0.25063, -0.47738, -0.08893, 0.09649, -0.57574, 0.19138, -0.14103, 0.51915, 0.18851, 0.03010, 0.09958, -0.48569, 0.13306, 0.02912, 0.01384, 0.25923, 0.10225, -0.09040, -0.32389, -0.24441, -0.09959, 0.15180, -0.23231, 0.05933, 0.13196, 0.57919, -0.37415, 0.15076, -0.00473, -0.00915, -0.21773, 0.00185, -0.31442, -0.05701, -0.16205, -0.20571, -0.30739, 0.34166, 0.32831, 0.24891, -0.10780, 0.35190, 0.49997, -0.07121, -0.06739, 0.23996, -0.51162, 0.35696, -0.19288, -0.06150, -0.00211, -0.10440, -0.20337, 0.21916, -0.24917, -0.10438, 0.46726, 0.09295, -0.61664, -0.01790, -0.40113, -0.32674, -0.45269, -0.05442, -0.18958, -0.17513, -0.12574, 0.05589, -0.12903, 0.15018, -0.24348, 0.12981, -0.05724, 0.02806, 0.42176, -0.12478, -0.03598, -0.29495, 0.05280, 0.30015, -0.06785, -0.03343, -0.06424, 0.15256, -0.49415, -0.15664, -0.56440, -0.16290, -0.42638], dtype=np.float32),
  "france":  np.array([-0.18219, 0.06323, 0.13155, 0.19178, 0.45262, 0.07333, 0.00745, 0.13176, -0.04027, 0.18680, 0.58815, 0.72726, -0.20907, 0.29821, 0.09369, -0.04996, -0.12140, 0.24599, -0.28084, -0.37325, 0.61308, 0.02491, 0.15377, 0.35447, -0.27714, -0.10280, 0.42167, 0.28323, 0.08624, 0.04554, -0.11203, -0.09141, -0.15559, -0.12945, 0.17985, 0.05397, -0.20674, -0.00157, 0.02078, 0.64305, 0.10637, 0.12813, -0.03183, 0.26590, -0.00545, -0.20655, 0.19704, -0.04977, -0.16935, 0.06854, -0.15001, -0.07362, 0.09009, 0.10963, -0.45723, -0.00408, 0.08126, -0.33670, 0.11713, 0.31598, 0.20151, 0.11635, -0.11566, -0.40446, -0.30366, 0.06843, 0.03220, 0.33058, 0.09078, 0.13582, 0.28371, 0.38705, 0.27897, 0.09391, 0.26882, -0.26058, 0.00064, -0.21219, -0.12738, -0.40399, 0.40183, 0.39339, 0.08117, -0.32321, 0.22676, 0.13104, -0.10246, -0.52784, -0.25873, -0.09247, -0.37017, 0.41755, -0.14285, 0.15307, -0.35519, -0.03849, 0.13401, 0.13112, -0.35291, -0.36664, 0.03136, 0.80553, 0.41274, 0.01832, 0.22198, 0.05979, -0.00397, 0.11959, -0.02017, 0.50051, 0.07584, -0.12657, -0.01628, -0.00025, 0.09829, -0.26941, 0.02597, -0.12949, 0.27935, -0.23689, -0.01785, 0.19605, -0.41837, 0.07977, -0.05644, 0.01962, 0.20559, -0.07866, -0.28765, -0.26882, -0.10859, 0.15512, -0.00888, -0.09430, -0.03187, 0.10582, 0.19566, 0.09776, 0.06356, 0.08710, -0.15624, 0.04377, 0.07377, -0.17253, 0.03722, -0.02852, -0.15565, -0.19175, 0.32213, 0.44915, 0.32993, -0.63005, 0.23579, 0.14376, 0.12855, 0.14385, 0.16940, -0.47164, 0.24567, -0.18858, -0.03186, -0.09289, -0.61278, -0.16884, 0.22779, -0.02604, 0.23900, 0.04567, 0.08428, -0.32738, -0.19242, -0.10240, -0.27779, -0.24567, 0.24114, -0.39621, -0.44449, 0.20193, -0.14436, 0.03190, 0.38711, -0.41903, -0.11760, 0.08490, 0.23199, 0.61604, -0.13850, -0.01056, -0.50419, 0.36371, 0.59154, -0.23931, -0.13358, 0.17867, -0.08431, -0.26031, 0.18438, -0.27460, -0.15105, -0.13703], dtype=np.float32),
  "rome":    np.array([-0.11204, 0.22966, -0.13668, 0.42749, 0.39183, 0.47309, 0.25666, 0.07822, -0.24798, -0.13575, 0.46827, 0.06815, -0.17023, 0.39090, -0.08079, -0.29234, -0.32003, 0.23166, -0.12039, -0.17116, 0.21441, -0.35751, -0.18170, -0.29657, -0.07359, -0.20686, 0.12989, 0.12463, -0.10202, -0.23102, -0.04400, -0.12517, -0.03459, -0.12223, 0.31041, -0.14974, -0.39004, -0.13628, 0.06586, 0.34412, 0.04909, 0.02732, 0.04259, -0.23218, 0.16646, -0.36989, -0.24263, -0.07604, 0.09067, 0.02604, -0.23639, -0.04283, 0.42932, 0.21634, -0.40271, 0.07764, 0.13056, -0.06083, 0.19618, 0.54013, 0.14888, -0.24765, -0.25302, -0.30739, 0.01029, 0.29831, 0.07945, 0.45129, 0.22313, 0.07088, 0.33073, 0.53051, -0.02538, -0.34967, 0.07828, 0.00247, -0.39456, -0.02917, -0.15832, -0.11131, 0.80995, 0.23170, 0.13627, -0.43837, 0.42694, -0.17925, -0.19091, -0.15370, -0.09075, 0.00463, -0.25978, 0.44907, 0.12163, -0.10187, -0.29649, -0.16680, -0.16898, -0.18889, -0.05334, -0.68051, -0.21924, 1.08901, -0.08938, -0.01084, 0.64356, 0.07706, -0.38139, 0.38721, 0.38537, 0.06449, 0.02829, -0.11856, -0.18428, 0.23894, 0.32629, -0.79727, -0.02919, -0.08177, 0.06304, -0.30370, 0.06447, -0.09200, -0.16051, 0.16706, 0.28596, 0.24376, 0.01903, -0.24568, -0.15003, 0.03472, -0.18781, -0.06927, 0.09267, 0.14936, -0.16940, -0.02855, 0.08061, -0.12080, 0.02909, 0.04295, 0.25802, -0.37922, 0.32467, -0.24179, 0.10241, -0.42759, -0.26248, -0.39167, -0.29641, 0.28513, 0.25339, -0.46392, 0.38545, 0.01698, -0.02567, -0.01231, -0.10522, -0.01214, 0.57213, -0.07360, -0.17799, 0.11318, -0.23048, 0.37151, -0.08626, 0.04074, 0.04089, 0.59131, 0.11168, -0.08926, 0.06659, -0.22881, -0.31912, -0.39018, -0.18094, -0.27141, -0.33712, -0.07491, -0.19420, -0.13975, 0.37040, 0.00748, -0.23212, -0.27160, 0.00709, 0.19954, -0.19594, -0.06819, -0.33766, 0.28512, 0.11063, -0.17811, -0.39012, -0.00868, 0.18033, 0.10791, -0.16924, -0.43922, -0.42524, -0.02648], dtype=np.float32),
  "italy":   np.array([-0.15340, 0.01927, 0.16926, 0.41839, 0.53047, 0.17460, 0.39970, -0.03224, -0.20870, 0.22605, 0.59484, 0.17090, -0.38235, 0.31133, 0.20538, -0.04493, -0.27729, 0.32064, -0.05734, -0.34299, 0.50433, 0.03479, 0.20769, 0.13809, 0.01233, -0.25943, 0.10879, 0.46709, -0.00929, -0.26202, -0.30844, -0.18392, -0.03692, 0.00350, -0.09822, -0.10232, -0.51924, -0.31376, -0.10353, 0.38236, 0.19212, -0.14145, 0.08053, 0.05962, -0.30530, -0.51329, -0.32229, -0.07294, -0.26528, 0.22517, -0.12624, 0.25401, 0.08901, 0.04630, -0.25246, -0.51830, 0.02983, -0.28021, 0.20391, 0.49642, 0.14568, 0.01941, -0.07164, -0.17974, -0.27320, 0.35721, 0.07217, 0.40795, 0.07221, 0.00986, 0.13876, 0.50403, 0.01540, 0.03218, 0.25567, -0.27005, -0.25822, -0.18744, -0.27059, -0.21146, 0.45264, 0.23391, 0.03439, -0.15279, 0.33551, 0.13907, -0.30575, -0.41350, -0.33525, -0.19169, -0.60787, 0.42002, 0.04735, 0.07948, -0.24537, 0.16873, -0.01630, -0.21978, -0.19681, -0.31645, 0.14211, 0.90986, 0.27263, 0.08888, 0.58419, 0.08633, 0.18484, -0.02439, 0.22421, 0.22149, -0.06583, -0.32640, -0.54394, 0.03884, 0.01773, -0.72140, 0.05312, 0.15851, 0.34632, -0.33199, 0.07043, 0.31847, -0.33012, 0.02117, -0.03411, 0.11081, 0.16380, 0.03609, -0.28953, -0.04240, -0.17554, 0.11706, -0.06782, 0.06808, -0.40821, -0.01016, -0.00167, -0.04228, 0.16646, 0.24710, -0.04438, 0.01156, 0.25375, -0.14640, 0.04143, -0.34428, -0.22431, -0.47259, 0.16204, 0.67493, 0.23886, -0.45043, 0.26658, 0.04719, -0.21568, -0.09716, 0.27766, -0.14342, 0.48150, -0.17313, -0.15145, 0.20802, -0.69512, -0.02889, -0.02191, 0.02746, 0.05462, 0.20904, -0.04408, -0.51693, -0.09057, -0.19239, -0.23878, -0.43717, 0.11679, -0.19186, -0.32573, 0.14074, -0.16839, 0.16045, 0.46155, -0.07977, -0.41751, -0.02728, 0.00373, 0.53923, -0.16084, 0.02926, -0.49980, 0.07807, 0.39652, -0.47138, -0.22931, -0.16463, -0.12133, -0.06451, 0.08685, -0.50419, -0.06190, -0.04058], dtype=np.float32),
  "cat":     np.array([0.26959, 0.28885, -0.25441, 0.29609, -0.17498, 0.61828, 0.42658, -0.33028, -0.55416, 0.24162, 0.11672, 0.56325, 0.49905, 0.27891, 0.03885, -0.33034, -0.24380, 0.55970, -0.33500, 0.15236, 0.19751, -0.33009, -0.08106, 0.41318, 0.11463, -0.15276, -0.18670, 0.41262, -0.41940, -0.57349, 0.51915, -0.35298, 0.08269, -0.37676, -0.10860, -0.47629, -0.02687, 0.15058, 0.06812, -0.08056, -0.11488, -0.29507, 0.26964, 0.18912, 0.12296, -0.73252, 0.54527, 0.03410, -0.13766, -0.18625, -0.02463, -0.15198, -0.15954, 0.22369, -0.21203, -0.32467, 0.21283, -0.24604, 0.23843, 0.30972, 0.29833, -0.47957, 0.36598, 0.08849, -0.81814, 0.47101, 0.33824, 0.43682, -0.43284, 0.14764, 0.50940, 0.14663, 0.08716, -0.53612, 0.69195, -0.67798, 0.12627, 0.19669, -0.23032, 0.12126, -0.05029, -0.23783, -0.24113, 0.08685, 0.41477, 0.21425, -0.01235, -0.14322, -0.22114, -0.06646, -0.58032, 0.73164, -0.04452, -0.47726, -0.40563, 0.44273, 0.02877, 0.12494, 0.27107, -0.00345, 0.08363, 0.24003, 0.48798, -0.19554, 0.28059, -0.34323, -0.09547, 0.27878, -0.00391, 0.07340, -0.01744, 0.31574, -0.04113, 0.39985, -0.29742, -0.32761, 0.09009, -0.37477, 0.29915, 0.34550, 0.61891, 0.10591, 0.12979, 0.21423, 0.44679, -0.33396, 0.38823, 0.15324, -0.20648, 0.12805, 0.27564, -0.31365, 0.45786, -0.13229, 0.02166, -0.27080, 0.04489, -0.39346, 0.21707, -0.00268, 0.13500, 0.21508, 0.29841, -0.78957, -0.13894, 0.00594, -0.44211, 0.09468, 0.24073, 0.26808, 0.68818, 0.25129, 0.02253, 0.39138, 0.16902, -0.31680, 0.86300, -0.45429, 0.03778, -0.43180, -0.41733, 0.18255, -0.57203, 0.74918, -0.34643, 0.17222, -0.12939, 0.03335, -0.09112, -0.47202, 0.05456, -0.32359, -0.36980, 0.03217, 0.65667, 0.29583, -0.12857, -0.66626, -0.32521, -0.63238, 0.39975, 0.12622, -0.45418, 0.06869, -0.01442, 0.31165, -0.39287, 0.24561, -0.44056, -0.07728, 0.10282, -0.40999, -0.49304, -0.30354, 0.27296, 0.48230, -0.25617, -0.51907, 0.45467, -0.10298], dtype=np.float32),
  "dog":     np.array([0.22407, 0.19778, -0.12492, 0.02943, -0.01023, 0.30840, 0.47067, -0.09805, -0.24328, 0.48196, -0.05229, 0.67571, 0.28807, 0.28325, -0.07169, 0.07745, -0.50867, 0.33654, -0.31787, 0.21052, -0.05775, -0.40865, 0.29858, 0.16141, -0.22317, -0.16911, -0.06106, 0.17131, -0.45845, -0.23939, 0.02698, 0.24134, 0.10973, -0.17744, -0.03557, -0.17743, 0.09569, 0.08071, 0.49728, -0.24340, -0.02540, 0.07004, 0.41042, -0.03981, 0.17039, -0.64987, 0.03488, 0.04470, 0.07692, 0.08220, 0.02484, -0.28589, 0.37340, -0.10491, -0.32158, -0.33000, -0.21721, -0.07934, 0.43070, 0.18908, 0.37931, -0.13734, 0.39928, -0.08460, -0.66089, 0.37540, 0.46753, 0.48962, -0.21238, 0.11802, 0.26598, 0.01917, 0.59764, -0.42530, 0.65706, -0.36850, 0.22453, 0.14608, -0.08320, -0.04758, 0.15790, -0.61434, 0.05107, -0.15172, 0.09765, -0.14117, -0.34464, 0.08492, -0.23829, -0.07973, -1.01947, 0.40416, -0.21399, -0.47815, -0.16949, -0.08467, 0.14683, -0.06763, 0.24079, 0.40548, -0.28201, 0.33629, -0.06622, -0.39812, 0.27318, 0.21343, 0.22216, 0.69422, 0.02870, 0.14304, -0.15980, 0.75023, -0.00934, -0.02005, -0.18365, 0.21365, 0.05600, -0.39873, 0.36346, 0.08854, 0.39175, -0.05878, -0.10227, 0.05216, 0.30803, -0.17737, 0.23768, 0.18228, -0.17957, -0.11890, -0.23524, -0.37678, 0.17571, -0.24524, -0.25780, -0.02954, 0.08782, -0.07055, 0.11858, 0.17058, 0.09321, 0.04601, 0.01212, -0.47209, -0.34194, 0.14436, -0.80984, 0.35469, 0.50964, -0.37567, 0.79713, 0.03317, -0.02606, -0.17922, -0.14554, -0.22435, 0.17604, -0.42443, -0.06444, 0.05082, -0.42109, -0.40434, -0.31260, 0.26361, -0.00811, 0.14489, 0.23344, 0.20476, 0.09684, -0.40022, -0.07501, -0.28960, -0.49280, -0.43434, 0.45920, 0.08183, 0.01575, -0.15572, -0.47815, -0.16507, 0.31122, 0.47231, -0.34021, -0.26883, 0.02138, 0.38776, 0.20281, -0.10099, -0.55102, -0.15751, 0.29602, -0.26543, -0.40633, -0.05946, -0.14480, -0.35555, 0.14056, -0.35140, 0.20284, 0.17339], dtype=np.float32),
  "table":   np.array([-0.24251, -0.02881, -0.04184, 0.11320, 0.07676, 0.31765, 0.04621, 0.27889, 0.08690, -0.43657, 0.32255, 0.05912, -0.05635, -0.20437, -0.13043, -0.19339, -0.05859, 0.63758, -0.51283, 0.17039, 0.11268, -0.09266, 0.09368, -0.25822, -0.08813, 0.00769, -0.36577, 0.60987, 0.13547, -0.30731, 0.10317, -0.39184, -0.13124, 0.07061, 0.08820, -0.38111, -0.17262, -0.10532, -0.14675, -0.21426, -0.32254, 0.16817, 0.04332, -0.00583, -0.05811, -0.72103, -0.35971, -0.07069, -0.20500, -0.53578, 0.06537, 0.01355, -0.03441, 0.16508, -0.41717, -0.48964, -0.02400, 0.08731, 0.23961, 0.56417, -0.06052, -0.19098, 0.24184, -0.14215, -0.08555, 0.35112, -0.05380, -0.16210, 0.18068, 0.14535, 0.34078, 0.81779, 0.13351, -0.44080, 0.17847, -0.33536, -0.02692, 0.23795, -0.48981, 0.27085, 0.21318, -0.26794, -0.47646, -0.47865, 0.43042, 0.12462, -0.23047, -0.00634, -0.44237, -0.54118, -0.42910, 0.51173, 0.15166, 0.06616, -0.00050, 0.04257, 0.38788, 0.20401, 0.14725, -0.34527, 0.26782, 0.41257, 0.26215, -0.13500, -0.25081, 0.19053, 0.11008, 0.48502, 0.03358, -0.07010, 0.21036, 0.02110, -0.75852, 0.17353, -0.04961, 0.08930, -0.22679, -0.26382, 0.08666, 0.02339, -0.04276, 0.14007, 0.20233, -0.22444, -0.13015, -0.26452, 0.22854, -0.19656, 0.08982, -0.11065, -0.01083, -0.14230, 0.55542, -0.40077, 0.43193, -0.01451, 0.42432, 0.15159, 0.09869, -0.31474, -0.09218, 0.34171, 0.00812, 0.04699, -0.58361, 0.44378, -1.12451, 0.19731, 0.13642, 0.32610, -0.03793, -0.12543, 0.33448, 0.08818, -0.39563, 0.49534, -0.14100, -0.56391, 0.18019, -0.09507, -0.28584, -0.23887, -0.15002, 0.24602, -0.17264, -0.82555, -0.39407, 0.01208, 0.13483, -0.50127, 0.43927, -0.47681, -0.05441, -0.22195, 0.20945, -0.32043, -0.16494, -0.12170, -0.05999, 0.00696, -0.03224, 0.15639, -0.50188, -0.17326, 0.19615, 0.53008, -0.08557, 0.25442, -0.23749, -0.42338, 0.22123, 0.12498, -0.11213, -0.41178, -0.18820, -0.14456, -0.05304, -0.37926, 0.32974, -0.46066], dtype=np.float32),
  "actor":   np.array([-0.34081, 0.29496, 0.06248, 0.02581, 0.28570, 0.21760, 0.21162, -0.15345, -0.47740, 0.71132, 0.16290, 0.47373, 0.12448, 0.37240, 0.18513, 0.40150, -0.50863, 0.52983, -0.31956, 0.19327, 0.35366, -0.28345, 0.34000, -0.19229, 0.34402, -0.67792, 0.04072, 0.53878, -0.02601, -0.50897, -0.09529, -0.11532, -0.27105, -0.36717, -0.15617, -0.63148, -0.05268, 0.25237, -0.25513, 0.79959, -0.33079, -0.21113, -0.04093, 0.24306, -0.24102, -0.49024, -0.31635, 0.51906, 0.08259, -0.07093, 0.05762, -0.31068, 0.19975, -0.29971, 0.24759, -0.54989, -0.47743, -0.41719, 0.47868, 0.89240, 0.28170, -0.01673, 0.02956, -0.14534, -0.33202, -0.18015, -0.02318, 0.19072, -0.12974, 0.05497, 0.18587, 0.23634, 0.20169, -0.04462, 0.04632, -0.17732, 0.27773, 0.26587, -0.70917, -0.15072, 0.20083, -0.30606, 0.12365, -0.91372, 0.38745, 0.44369, -0.45890, -0.62393, -0.87489, -0.12528, -0.84243, 0.71368, -0.13824, -0.30721, -0.45125, -0.02410, -0.08495, 0.14208, 0.21343, 0.11456, 0.57746, 0.51518, 0.74713, -0.44282, 0.46277, 0.26860, 0.26077, 0.23483, -0.19523, -0.11328, -0.28465, 0.57823, -0.35354, 0.14874, -0.03730, 0.06788, -0.03338, -0.14657, 0.87178, 0.28849, 0.43849, 0.27989, -0.17953, 0.43501, -0.30121, -0.32942, -0.35602, -0.24987, -0.28101, -0.39073, 0.21141, -0.08250, 0.44348, -0.14257, 0.11083, 0.42711, 0.48078, -0.40088, 0.46692, 0.35036, -0.07102, -0.33256, 0.15478, -0.21769, -0.43711, 0.06046, -0.09511, -0.13849, -0.45530, 0.62191, 0.30547, -0.24135, -0.22935, 0.14534, 0.24329, -0.02397, 0.53990, -0.10288, 0.24224, -0.39644, -0.42022, 0.20305, 0.10345, -0.29594, -0.10168, 0.30617, 0.27342, 0.73330, 0.33558, -0.46253, 0.52831, -0.53941, -0.71974, -0.17000, 0.33471, 0.31247, -0.39216, 0.04075, -0.16177, -0.56716, 0.34696, -0.04393, -0.14483, -0.43536, 0.20816, 0.60463, 0.20165, -0.21832, -0.35554, -0.41931, -0.18691, -0.31201, -0.52478, 0.25814, -0.37204, -0.06317, -0.00777, -0.26136, 0.12635, -0.34427], dtype=np.float32),
  "actress": np.array([-0.03609, 0.48259, 0.10655, 0.15476, 0.74611, 0.35226, 0.26543, -0.09824, -0.74194, 0.80526, 0.48950, 0.66349, -0.04611, 0.19693, 0.01656, 0.38376, -0.47511, 0.58739, -0.42210, 0.36695, 0.23638, -0.13260, 0.11278, -0.07940, 0.65258, -0.78240, -0.19338, 0.22543, 0.03659, -0.34090, -0.35696, -0.05129, -0.15844, -0.55043, -0.11435, -0.78682, -0.05935, 0.09748, -0.36884, 0.66672, -0.33169, -0.22609, -0.27201, 0.42125, -0.33582, -0.50228, -0.41939, 0.29667, 0.25257, 0.07340, 0.04644, -0.31010, 0.50941, -0.25233, 0.04330, -0.33538, -0.34326, -0.35768, 0.31392, 0.94899, -0.00139, -0.22785, -0.27498, 0.01661, -0.42216, -0.27224, 0.08699, 0.09866, -0.15192, 0.05982, 0.09475, 0.45494, 0.09575, -0.27159, -0.07097, -0.53471, 0.38938, -0.05162, -0.69329, -0.03841, 0.18119, 0.01944, 0.06560, -1.02032, 0.21652, 0.43684, -0.66284, -0.68760, -0.74737, -0.10185, -0.65727, 0.47983, -0.04231, -0.61117, -0.52429, -0.08394, -0.11283, 0.16290, 0.03889, 0.29620, 0.63489, 0.61709, 0.62876, -0.45758, 0.09166, 0.04892, 0.30382, 0.02665, -0.12999, -0.27397, -0.28191, 0.82811, -0.17342, 0.15485, 0.21338, -0.16891, -0.08952, -0.11762, 0.96667, 0.28835, 0.29522, 0.17668, -0.52001, 0.31824, -0.65100, -0.52572, -0.23879, 0.13706, -0.26424, -0.37772, 0.33564, -0.39051, 0.37396, -0.41123, -0.19594, 0.23036, 0.26708, -0.66899, 0.37358, 0.33660, -0.41522, -0.12397, 0.26031, -0.21327, -0.54083, -0.04208, -0.34020, 0.04574, -0.43365, 1.02148, 0.61688, -0.26216, -0.53543, 0.11807, 0.18984, -0.05739, 0.32445, -0.13142, 0.56787, -0.50690, -0.35184, 0.04989, 0.07077, -0.01400, 0.07540, 0.44800, 0.13940, 0.76294, 0.13431, -0.64686, 0.21227, -0.18815, -0.90020, -0.28664, 0.35314, 0.40869, -0.25572, 0.25900, 0.17797, -0.81918, 0.12856, -0.18146, 0.12532, -0.24020, 0.36360, 0.87130, -0.06650, -0.24994, -0.14889, -0.34492, -0.04197, -0.29965, -0.32774, 0.17399, -0.36773, 0.09036, -0.06386, -0.28132, 0.17017, -0.33808], dtype=np.float32),
}

vocab = list(dict_string2dist.keys())
N = len(vocab)

# f: produces distributional embedding representations in R^D given localist one-hot representation in R^N
@jaxtyped(typechecker=beartype)
def f(x_N: Float[np.ndarray, "N"]) -> Float[np.ndarray, "D"]:
  # E_DV = np.stack([dict_string2dist[i] for i in vocab], axis=1)
  E_ND = np.stack([dict_string2dist[i] for i in vocab])
  y_D = E_ND.T @ x_N
  return y_D

xonehot_N = np.eye(N)[vocab.index("man")] # row-wise plucking basis vector from identity matrix I
yembedding_D = f(xonehot_N)
print("xonehot_N.shape", xonehot_N.shape)
print("xonehot_N.storage", xonehot_N)
print("f(xonehot_N) => yembedding_D.shape", yembedding_D.shape)
print("f(xonehot_N) => yembedding_D.storage", yembedding_D)
assert np.allclose(yembedding_D, dict_string2dist["man"])

# f_batched: produces batched distributional embedding representations in R^{BxD} given batched localist one-hot representation in R^{BxN}
@jaxtyped(typechecker=beartype)
def f_batched(X_BN: Float[np.ndarray, "B N"]) -> Float[np.ndarray, "B D"]:
  E_ND = np.stack([dict_string2dist[i] for i in vocab])
  y_BD = X_BN @ E_ND 
  return y_BD

Xonehot_BN = np.eye(N)[[vocab.index("king"), vocab.index("man"), vocab.index("woman")]]
Yembedding_BD = f_batched(Xonehot_BN)

print("Xonehot_BN.shape", Xonehot_BN.shape)
print("Xonehot_BN.storage", Xonehot_BN)
print("f_batched(Xonehot_BN) = Yembedding_3D.shape", Yembedding_BD.shape)
print("f_batched(Xonehot_BN) = Yembedding_3D.storage", Yembedding_BD)
assert np.allclose(Yembedding_BD[0], dict_string2dist["king"])
assert np.allclose(Yembedding_BD[1], dict_string2dist["man"])
assert np.allclose(Yembedding_BD[2], dict_string2dist["woman"])

Continuous space language models have recently demonstrated outstanding results across a variety of tasks. In this paper, we examine the vector-space word representations that are implicitly learned by the input-layer weights. We find that these representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset. This allows vector-oriented reasoning based on the offsets between words. For example, the male/female relationship is automatically learned, and with the induced vector representations, “King - Man + Woman” results in a vector very close to “Queen.”

following the mikolov paper..we can ask questions.. if x is a linear combination..

vector space $V$ vector space $V$ Grant Sanderson's 3Blue1Brown Essence of Linear Algebra Chapter 16: Abstract Vector Spaces

abbrev VectorSpace (K V : Type) [Field K] [AddCommGroup V] := Module K V

linear combinationlinear combinationGrant Sanderson's 3Blue1Brown Essence of Linear Algebra Chapter 2: Linear Combinations, Span, and Basis Vectors

def is_linear_combination (S : Set V) (x : V) : Prop :=
  ∃ (s : Finset V) (f : V → K), (↑s ⊆ S) ∧ (x = Finset.sum s (fun v => f v • v))

spanspan

linear independancelinear independance

basisbasis

standard basisstandard basis

rankrank

dimensiondimension

(show there is no exact linear combination)

(so we need to induce a geometry (length, distance, and angle)) with a norm norms in R^n can be induced with inner products (but not all) inner productinner product inner product spaceinner product space $(V, < \overset{,}{˙} \cdot >)$

class InnerProductSpace_v (V : Type) [AddCommGroup V] [VectorSpace ℂ V] where
  inner : V → V → ℂ
  inner_self_im_zero : ∀ (v : V), (inner v v).im = 0
  inner_self_nonneg : ∀ (v : V), 0 ≤ (inner v v).re
  inner_self_eq_zero : ∀ (v : V), inner v v = 0 ↔ v = 0
  inner_add_left : ∀ (u v w : V), inner (u + v) w = inner u w + inner v w
  inner_smul_left : ∀ (a : ℂ) (v w : V), inner (a • v) w = a * inner v w
  inner_conj_symm : ∀ (v w : V), inner v w = conj (inner w v)

normnorm

lengthlength

distancedistance

matrix vector multiplicationmatrix multiplication $y = Ax$

function matrixfunction matrix $y = Wx$

print("nearest", nearest(xqueen_D, exclude=("king", "man", "woman")))  # ['queen', ...]
print(nearest(xrome_D, exclude=("paris", "france", "italy")))  # ['rome', ...]
# --- linear combinations ---
# any weighted sum of vectors is a linear combination: c0*v0 + c1*v1 + c2*v2
c_3 = np.array([1.0, -1.0, 1.0])
X_3D = np.stack([dict_string2dist["king"], dict_string2dist["man"], dict_string2dist["woman"]])  # (3, 50)
xqueen_D = c_3 @ X_3D  # (3,) @ (3, 50) → (50,)
print("xqueen_D.shape", xqueen_D.shape)
print("linear combination nearest:", nearest(xqueen_D, exclude=("king", "man", "woman")))  # queen

# midpoint: equal weight on man and woman → "generic human", equidistant from both
xhuman_D = 0.5 * dict_string2dist["man"] + 0.5 * dict_string2dist["woman"]
print("midpoint nearest:", nearest(xhuman_D))  # man and woman should top the list



# --- span ---
# is queen in the span of {king, man, woman}?
# find least-squares coefficients: queen ≈ c0*king + c1*man + c2*woman
A_D3 = np.stack([dict_string2dist["king"], dict_string2dist["man"], dict_string2dist["woman"]]).T  # (50, 3)
cqueen_3, _, _, _ = np.linalg.lstsq(A_D3, dict_string2dist["queen"], rcond=None)
xqueenapprox_D = A_D3 @ cqueen_3
print("cqueen_3.shape", cqueen_3.shape, "cqueen_3", cqueen_3)  # near (+1, -1, +1)
print("queen residual norm:", np.linalg.norm(dict_string2dist["queen"] - xqueenapprox_D))  # small

# is table in the span of {cat, dog}? (unrelated subspace — residual should be large)
A_D2 = np.stack([dict_string2dist["cat"], dict_string2dist["dog"]]).T  # (50, 2)
ctable_2, _, _, _ = np.linalg.lstsq(A_D2, dict_string2dist["table"], rcond=None)
xtableapprox_D = A_D2 @ ctable_2
print("table residual norm:", np.linalg.norm(dict_string2dist["table"] - xtableapprox_D))  # large

assert np.linalg.norm(dict_string2dist["queen"] - xqueenapprox_D) < np.linalg.norm(dict_string2dist["table"] - xtableapprox_D)

1.7 Matrix Factorizations: Eigenvalues, Singular Values, and their Decompositions

$↩$ Table of Contents

angleangle projectionprojection orthogonalityorthogonality

1.8 Learning Subspaces from Data with Principal Component Analysis via Maximum Variance...

$↩$ Table of Contents

2. Learning Functions from Data with Optimization in `torch`

In chapter 1, you've initiated your transition from deductive software 1.0 to inductive software 2.0 by:

using random variables instead of deterministic variables by distributing the truth of types onto a weighted set of events and representing such types in high dimensional coordinate systems
recovering programs from data rather than specifying programs with instructions by estimating the parameters of distributions with maximum likelihood and maximum a posteriori estimation

And so now, the formulation of a computer program learning from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ starts to make more sense with respect to learning distributions from data. In chapter 2, we extend the notion of experience $E$ and tasks $T$ to not just learn distributions $X$ from data, but that of learning functions $f : X \to Y$ . So instead of recover a single random variable $X$ from data like in chapter 1, we will recover two random variables $X$ and $Y$ to construct linear functions $f : X \to Y$ (todo, maybe rephrase. depending on how the link between ERM of squared error and MLE is presented).

More formally, given observations of input-outputWe stick to the domain neutral terminology of inputs and outputs. However the input-output pairs $(x^{(i)}, y^{(i)})$ are also referred to as predictors/response and features/labels pairs comprising some dataset $D = {(x^{(i)}, y^{(i)})}_{i = 1}^{n}$ , with function approximation, learn some mapping from input space to output space $f : X \to Y$ , and evaluate such learned mapping $\overset{y}{^} = f (x^{n + 1})$ in order to predict new outputs given unseen input well, referred to as generalization. When historical data of the expected output $y^{(i)}$ given some input $x^{(i)}$ is available, we refer to this regime as supervised learning, for the learning algorithm is being "supervised" with the correct answerAs opposed to unsupervised learning which we'll cover in part II.

However, roughly speakingOrdinal data can also exist in addition to numerical and categorical data(todo), there are two settings of supervised learning depending on the type of the output space $Y$ that the learned function $f$ must map to (with the input space $X$ usually being the vector space of $R^{d}$ ). If the phenomena being modeled admits quantitative outputs, then the function that needs to be learned takes the form $f : R^{d} \to R$ and is referred to as regressionThe origin of the name is instructive (Stigler, 1986). It comes from 19th century investigations into the relationship between the attributes of parents and their children. People who are taller (heavier, faster,...) than average tend to have children who are also taller than average, but not quite as tall., whereas if it admits qualitative outputs, then the function that needs to be learned takes the form $f : R^{d} \to {1, \dots, k}$ and is referred to as classification.

In the next two chapters of Chapter 2. Learning Functions From Data with pytorch will implement linear models for both cases of regression and classification with torch, and Chapter 3. Accelerating Basic Linear Algebra Subroutines with teenygrad.Tensor will re-implement them with our own deep learning framework teenygrad.

2.1 From Recovering Data to Recovering Functions

2.2 Predicting Scalars by Linearly Modeling Regression

$↩$ Table of Contents

Similarly to learning distributions $X$ from data, learning functions $f : X \to Y$ from data roughly follows the three step process of selecting a model for the task $T$ , a performance measure $P$ , and optimizing such measure on experience $E$ . Let's take a look at the linearly modeling regression with squared error loss, a problem which straddles tools from all three languages of probability theory, linear algebra, and calculus.

For the task of house price prediction, the input space $X = R^{d}$ is the size in square feet, and the output space $Y = R$ is the price, meaning that the function which needs to be recovered has the type of $f : R^{d} \to R$ . An assumption about the structure of the data needs to be made, referred to as the inductive bias. The simplest assumption to make is to assume that there exists some linear relationship between the data $x$ and parameters $w$ so that $f$ ends up being modeled as an inner product plus bias, parameterized as $f (x; w, b)$ : $f f (x; w, b) : R^{d} \to R : = (w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n}) + b = (i = 1 \sum d w_{i} x_{i}) + b = w^{⊤} x + b$

One small adjustment we can make to $f$ is to fold the final bias term $b$ into the vector $w$ as $w_{1}$ , increasing the total dimensionality so that $w \in R^{d + 1}$ , and adjusting $x$ accordingly so that $x_{1} := 1$ and $x \in R^{d + 1}$ . Then, the equation of our line with bias is simply parameterized as $f (x; w)$ $f f (x; w) : R^{d + 1} \to R : = w^{⊤} x$

Semantically speaking, each property of the input house $x_{i} \in R^{d}$ is weighted by some weight $w_{i}$ , adjusted accordingly during the learning process so that it accurately reflects the property's influence on the output price $y \in R$ when evaluating a prediction $\overset{y}{^}$ . We can evaluate $f$ on all $x^{(i)} \in D$ with a randomly intialized $w$ to ensure we've wired everything up correctly.

However, while specifying the equation of the linear regression model on ~~paper~~ screen and subsequently evaluating the computation manually might have been sufficient for pre-computational mathematicians such as Gauss and Legendre, the discipline of statistical learning is entirely predicated on computational methods. Let's continue to use PyTorch's core n-dimensional array datastructure with torch.Tensor, this time modeling a function $f : X \to Y$ rather than some distribution $X$ :

(todo: maybe change example to closure following torch.nn)

import torch
def f(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
  return torch.dot(x, w)

in which we can update the names of our torch.Tensor variables with the ranks of their vector spacesReferred to as Shape Suffixes, described by Noam Shazeer, an engineer from Google to clarify what dimensions these operations are evaluated on:

import torch
def f_shapesuffixed(x_D: torch.Tensor, w_D: torch.Tensor) -> torch.Tensor:
  return torch.dot(x_D, w_D)

and finally, actually enforce them via runtime typechecking with the jaxtyping package (try swapping the return statements below to trigger a type error with the return type):

import torch
from jaxtyping import Float, jaxtyped
from beartype import beartype

@jaxtyped(typechecker=beartype)
def f_typechecked(
  x_D: Float[torch.Tensor, "D"],
  w_D: Float[torch.Tensor, "D"]
) -> Float[torch.Tensor, ""]:
  # return X_ND
  return torch.dot(x_D, w_D)

Let's now sanity-check that f_typechecked is wired up correctly by evaluating it on all $x^{(i)} \in D$ with random parameter w_D. (todo: use actual housing data)

import torch
if __name__ == "__main__":
  X_ND = torch.tensor([
    [1500, 10, 3, 0.8],  # house 1
    [2100, 2,  4, 0.9],  # house 2
    [800,  50, 2, 0.3]   # house 3
  ], dtype=torch.float32)
  Y_N = torch.tensor([500000, 800000, 250000], dtype=torch.float32)

  w_D = torch.randn(4); print(f"random weight vector: {w_D}")
  for (xi_D,yi_1) in zip(X_ND,Y_N):
    yihat_1 = f_typechecked(xi_D,w_D)
    print(f"expected: ${yi_1}, actual: ${yihat_1:.2f}")

random weight vector: tensor([-0.3343,  0.0768,  2.7828, -0.1331])
expected: $500000.0, actual: $-492.46
expected: $800000.0, actual: $-690.89
expected: $250000.0, actual: $-258.08

Clearly, these outputs are unintelligible, and will only become intelligible with a "good" choice of parameter $w$ . Before we find such a parameter, let's make one more modification to our function $f$ to increase it's performance. Rather then evaluating f_typechecked on every input xi_D in the dataset zip(X_ND,Y_N) sequentially with a loop, we can evaluate all outputs with a single matrix-vector multiplication by updating $f$ 's function definition to the following

$f_{batched} f_{batched} (X; w) : R^{n \times d} \to R^{n} : = Xw = x_{1}^{⊤} w x_{2}^{⊤} w ⋮ x_{n}^{⊤} w$

and the corresponding PyTorch updated to

import torch
@jaxtyped(typechecker=beartype)
def f_batched(
  X_ND: Float[torch.Tensor, "N D"],
  w_D: Float[torch.Tensor, "D"]
) -> Float[torch.Tensor, ""]:
  return X_ND@w_D

Now that we've selected our model for the task $T$ , we can proceed with selecting our performance measure $P$ which the learner will improve on with experience $E$ .

2.3 Fitting Linear Regression by Directly Optimizing Squared Error

$↩$ Table of Contents

After the inductive bias on the family of functions has been made, the learning algorithm must find the function $\hat{f}$ with a good fit. Since artificial learning algorithms don't have visual cortex like biological humans[], the notion of "good fit" needs to defined in a systematic fashion. This is done by selecting the parameter $θ \in R^{m}$ which maximizes the likelihood of the data $p (d; θ)$ . Returning to the linear regression inductive bias we've selected to model the house price data, we assume there exists noise $ϵ^{(i)}$ in both our model (epistemic uncertainty) and data (aleatoric uncertainty), so that $y^{(i)} = θ^{⊤} x^{(i)} + ϵ^{i}$ where $ϵ^{(i)} \sim N (μ, σ^{2})$

prices $y^{i}$ are normally distributed conditioned on seeing the features $x^{i}$ with the mean being the equation of the line $θ^{⊤} x^{(i)}$ where $y^{(i)} ∣ x^{(i)} \sim N (μ = θ^{⊤} x^{(i)}, σ^{2})$ , then we have that

$p (y^{(i)} ∣ x^{(i)}; θ) = N (θ^{⊤} x^{(i)}, σ^{2}) = \frac{1}{2 π σ ^{2}} exp (- \frac{( y ^{(i)} - θ ^{⊤} x ^{(i)} ) ^{2}}{2 σ ^{2}})$ TODO: generate prose/exposition by repaging lecture/text in ram

Returning to the linear regression model, we can solve this optimization with a direct method using normal equations. QR factorization, or SVD.

def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray: return X_n*m+b

if __name__ == "__main__":
  X, Y = np.array([1500, 2100, 800]), np.array([500000, 800000, 250000]) #  data

  X_b = np.column_stack((np.ones_like(X), X))              # [1, x]
  bhat, mhat = np.linalg.solve(X_b.T @ X_b, X_b.T @ Y)   # w = [b, m]

  yhats = fhatbatched(X, mhat, bhat) # yhat
  for y, yhat in zip(Y, yhats):
    print(f"expected: ${y:.0f}, actual: ${yhat:.2f}")

To summarize, we have selected and computed

an inductive bias with the family of linear functions $f (X; θ) : = X @ θ$
an inductive principle with the least squared loss $L (θ) : = \sum_{i = 0}^{n} (y^{(i)} - X @ θ)^{2}$
the parameters which minimze the empirical risk, denoted as $(\hat{θ}) = arg min L (θ)$

Together, the inductive bias describes the relationship between the input and output spaces, the inductive principle is the loss function that measures prediction accuracy, and the minimization of the empirical risk finds the parameters for the best predictor.

2.4 Predicting Categories by Linearly Modeling Classification

logistic regression, cross entropy loss, gradient descent

2.5 Fitting Logistic Regression by Iteratively Optimizing Cross Entropy Loss

2.6

linearity depends on that the scalars are in a field (addition is associative) "scalars" in a field --> we are dealing with floating point representations "vectors" we saw coordinate systems in 1.4 "matrices" and we saw linear transformations in 2.1 mathematically, a vector is a coordinate in some basis of a vector space a matrix is a linear function, the linear, in numerical linear algebra, is aspirational.

3. Accelerating Functions and Data with Basic Linear Algebra Subroutines in `teenygrad`

3.1 From Abstract Linear Algebra to Numerical Linear Algebra

3.2 Virtualizing ND Logical Shapes to 1D Physical Storages with Strides

from typing import Self
import array, math
import teenygrad

class InterpretedTensor:
  @classmethod
  def arange(cls, end: int, requires_grad: bool=False) -> Self: return InterpretedTensor((end,), list(range(end)), requires_grad=requires_grad)
  @classmethod
  def zeros(cls, shape: tuple[int, ...]) -> Self:
    numel = math.prod(shape)
    tensor = InterpretedTensor((numel,), [0.0]*numel).reshape(shape)
    return tensor
  @classmethod
  def ones(cls, shape: tuple[int, ...]) -> Self:
    numel = math.prod(shape)
    tensor = InterpretedTensor((numel,), [1.0]*numel).reshape(shape)
    return tensor
  
  def __init__(self, shape: tuple[int, ...], storage: list[float], inputs: tuple[Self, ...]=(), requires_grad: bool=False) -> None:
    self.shape: tuple[int, ...] = shape
    self.stride: tuple[int, ...] = [math.prod(shape[i+1:]) for i in range(len(shape))] # row major, and math.prod([]) produces 1
    self.storage: list[float] = storage

    self.inputs: tuple[Self, ...] = inputs
    self._backward = lambda: None # callers override after init with self captured in closure
    self.grad: InterpretedTensor = InterpretedTensor.zeros(shape) if requires_grad else None # python can recursively type (no need for Box<_>) bc everything is a heap-allocated reference
  @property
  def numel(self): return math.prod(self.shape) # np (and thus jax) call this .size
  @property
  def ndim(self): return len(self.shape)
  @property
  def T(self) -> Self:
    assert self.ndim == 2
    m, n = self.shape
    t = InterpretedTensor((n, m), self.storage)
    t.stride = [self.stride[1], self.stride[0]]
    return t

  def reshape(self, shape: tuple[int, ...]) -> Self:
    self.shape = shape
    self.stride = [math.prod(shape[i+1:]) for i in range(len(shape))] # math.prod([]) produces 1
    return self
  
  def __repr__(self) -> str:
    return f"InterpretedTensor({self.chunk(self.storage, self.shape)})"
  @staticmethod
  def chunk(flat, shape):
    if len(shape) == 1: return flat[:shape[0]]
    size = len(flat) // shape[0]
    return [InterpretedTensor.chunk(flat[i*size:(i+1)*size], shape[1:]) for i in range(shape[0])]
  
  # backward f'(x)
  @staticmethod
  def topo(node: InterpretedTensor, seen: set[InterpretedTensor], output: list[InterpretedTensor]) -> None:
    if node in seen: return
    seen.add(node)
    for input in node.inputs: InterpretedTensor.topo(input, seen, output)
    output.append(node)

  def backward(self) -> None:
    seen, topologically_sorted_expression_graph = set(), []
    InterpretedTensor.topo(self, seen, topologically_sorted_expression_graph)

    self.grad = InterpretedTensor.ones(self.shape) # base case
    for tensor in reversed(topologically_sorted_expression_graph): tensor._backward()
  
  # forwards f(x)
  def __radd__(self, other: Self) -> Self: return self.__add__(other)
  def __add__(self, other: Self) -> Self:
    n, alpha = self.numel, 1
    x, y, z = array.array('f', self.storage), array.array('f', other.storage), array.array('f', [0.0]*(n))
    teenygrad.eagkers.cpu.saxpy(n, alpha, x, y) # y=axpy
    requires_grad = self.grad is not None or other.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self, other), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad
      other.grad += output_tensor.grad
    output_tensor._backward = _backward
    return output_tensor

  def __rmul__(self, other: Self) -> Self: return  self.__mul__(other)
  def __mul__(self, other: Self) -> Self:
    n = self.numel
    x, y, z = array.array('f', self.storage), array.array('f', other.storage), array.array('f', [0.0]*n)
    teenygrad.eagkers.cpu.smul(n, x, y, z)
    requires_grad = self.grad is not None or other.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(z), (self, other), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad * other
      other.grad += output_tensor.grad * self
    output_tensor._backward = _backward
    return output_tensor

  def __neg__(self) -> Self:
    n = self.numel
    x, y = array.array('f', self.storage), array.array('f', [0.0]*n)
    teenygrad.eagkers.cpu.saxpy(n, -1, x, y)
    requires_grad = self.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self,), requires_grad=requires_grad)
    def _backward():
      self.grad += -output_tensor.grad
    output_tensor._backward = _backward
    return output_tensor

  def __sub__(self, other: Self) -> Self:
    n = self.numel
    x, y = array.array('f', other.storage), array.array('f', self.storage)
    teenygrad.eagkers.cpu.saxpy(n, -1, x, y)
    requires_grad = self.grad is not None or other.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self, other), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad
      other.grad += -output_tensor.grad
    output_tensor._backward = _backward
    return output_tensor

  def tanh(self) -> Self:
    n = self.numel
    x, y = array.array('f', self.storage), array.array('f', [0.0]*n)
    teenygrad.eagkers.cpu.stanh(n, x, y)
    requires_grad = self.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self,), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad * (InterpretedTensor.ones(self.shape) - output_tensor * output_tensor) # f(x) = tanh(x) ==> f'(x) = 1 - tanh(x)^2
    output_tensor._backward = _backward
    return output_tensor

  def __rmatmul__(self, other: Self) -> Self: return other.__matmul__(self) # GEMM does not commute: AB != BA
  def __matmul__(self, other: Self) -> Self:

3.3 Implementing Basic Linear Algebra Subroutines in Python: `AXPY`, `GEMV` and `GEMM`

3.4 Accelerating the `BLAS` on `CPU`s with the `SISD` Programming Model: Instruction Sets, Pipelines and Hierarchies

The core idea behind the borrow checker is that variables have three kinds of permissions on their data:

Read (R): data can be copied to another location.
Write (W): data can be mutated in-place.
Own (O): data can be moved or dropped.

/// SGEMM with the classic BLAS signature (row-major, no transposes):
/// C = alpha * A * B + beta * C
fn sgemm(
  m: usize, n: usize, k: usize,
  alpha: f32, a: &[f32], lda: usize,
  b: &[f32], ldb: usize,
  beta: f32, c: &mut [f32], ldc: usize) {
  assert!(m > 0 && n > 0 && k > 0, "mat dims must be non-zero");
  assert!(lda >= k && a.len() >= m * lda);
  assert!(ldb >= n && b.len() >= k * ldb);
  assert!(ldc >= n && c.len() >= m * ldc);

  for i in 0..m {
    for j in 0..n {
      let mut acc = 0.0f32;
      for p in 0..k { acc += a[i * lda + p] * b[p * ldb + j]; }
      let idx = i * ldc + j;
      c[idx] = alpha * acc + beta * c[idx];
    }
  }
}

fn main() {
  use std::time::Instant;

  for &n in &[16usize, 32, 64, 128, 256] {
    let (m, k) = (n, n);
    let (a, b, mut c) = (vec![1.0f32; m * k], vec![1.0f32; k * n], vec![0.0f32; m * n]);

    let t0 = Instant::now();
    sgemm(m, n, k, 1.0, &a, k, &b, n, 0.0, &mut c, n);
    let secs = t0.elapsed().as_secs_f64().max(std::f64::MIN_POSITIVE);
    let gflop = 2.0 * (m as f64) * (n as f64) * (k as f64) / 1e9;
    let gflops = gflop / secs;

    println!("m=n=k={n:4} | {:7.3} ms | {:6.2} GFLOP/s", secs * 1e3, gflops);
  }
}

*AMD Zen5 Core Die Shot by Fritzchens Fritz (Hover for annotation by Nemez)*

Thus far in our journey we've successfully made the transition from programming software 1.0 which involves specifying of discrete data structures — like sets, associations, lists, trees, graphs, — with the determinism of types to programming software 2.0 which involves reovering continous data structures — like vectors, matrices, and tensors — with the stochasticity of probabilities. If you showed the mathematical equations for your models, loss functions, and optimizers to our mathematical cousinsIn particular, the physicists whom realized describing reality as minimizing free-energy was fruitful, such as Helmholtz with energy, Gibbs with enthalpy, and Boltzmann with entropy. who also made the same transition, they would understand the mathematical equations and even algorithmsas until recently, algorithms were understood to be a sequence of instructions to be carried out by a human. i.e Egyptian Multiplication, Euclid's Greatest Common Divisor. But they wouldn't understand how the code we've programmed thus far with Python and Numpy is being automatically calculated by the mechanical machine we call computers. For that we need to transition from users of the numpy framework to implementors our own framework, which we'll call teenygrad. This requires us moving from the beautiful abstraction of mathematics to the assembly heart and soul of our machines. This book uses RustTo backtrack to familiarize yourself with Rust, take a look at the experimental version of The Rust Programming Language, by from Will Crichton and Shriram Krishnamurthi, originally written by Steve Klabnik, Carol Nichols, and Chris Krycho., but feel free to follow along with C/C++In that case, take a look at "The C Programming Language by Brian Kernighan and Dennis Ritchie. — what's fundamental to the purpose of accelerating said basic linear algebra subroutines on the CPU is is choosing a language that 1. forgoes the dynamic memory management overhead of garbage collection and 2. has a compiled implementation which allows us to analyze and tune the actual instructions which an underlying processor executes. Using Rust will allow us to start uncovering what is really happening under the hood of accelerated Tensors To sidetrack to familiarize yourself with systems programming in the context of software 1.0 , read the classic of Computer Systems A Programmer's Perspectivewith instruction set architectures, microarchitecture, memory hierarchies

instruction set architecture
computation: control path, data path
communication: memory, input/output

In the last chapter we developed teenygrad.Tensor by virtualizing physical 1D storage buffers into arbitrarily-shaped logical ND arrays by specifying an iteration space with strides, and started our journey in accelerating the basic linear algebra subroutines defined on the Tensor with Rust, which speeds up the throughput performance of SGEMM from 10MFLOP/S to 1GFLOP/S. Simply executing a processor's native code — referred to as assembly code — rather than Python's bytecode results in an increase of two orders of magnitude.

Now that the code that is being executed is native assembly, we can dive deeper into the architecture of the machine in order to reason and improve the performance of teenygrad's BLAS.

SITP

I. Elements of Networks

I. Table of Contents

1. Representing Data with High Dimensional Stochasticity in `torch`

1.1 From Software 1.0 to Software 2.0

1.2 Distributing Words Stochastically with Probabilities

1.3 Composing Probabilities with Rules: Sum Rule, Product Rule, and Bayes Rule

1.4 Random Variables and their Distributions, Expectations, and Variance

1.5 Learning Probabilities from Data with Parameter Estimation via Maximum Likelihood

1.6 Representing Data Dimensionally with Random Vectors of Coordinate Spaces

1.7 Matrix Factorizations: Eigenvalues, Singular Values, and their Decompositions

1.8 Learning Subspaces from Data with Principal Component Analysis via Maximum Variance...

2. Learning Functions from Data with Optimization in `torch`

2.1 From Recovering Data to Recovering Functions

2.2 Predicting Scalars by Linearly Modeling Regression

2.3 Fitting Linear Regression by Directly Optimizing Squared Error

2.4 Predicting Categories by Linearly Modeling Classification

2.5 Fitting Logistic Regression by Iteratively Optimizing Cross Entropy Loss

2.6

3. Accelerating Functions and Data with Basic Linear Algebra Subroutines in `teenygrad`

3.1 From Abstract Linear Algebra to Numerical Linear Algebra

3.2 Virtualizing ND Logical Shapes to 1D Physical Storages with Strides

3.3 Implementing Basic Linear Algebra Subroutines in Python: `AXPY`, `GEMV` and `GEMM`

3.4 Accelerating the `BLAS` on `CPU`s with the `SISD` Programming Model: Instruction Sets, Pipelines and Hierarchies

3.5 Accelerating Computation with Instruction Level Parallelism via Loop Unrolling

3.6 Accelerating Computation with Data Level Parallelism via Vector/Matrix Extensions

3.7 Accelerating Communication with Memory Hierarchies via Register and Cache Blocking

3.8 Accelerating Computation with Thread Level Parallelism

Keyboard shortcuts

SITP