Pandemic Diary

11 October 2023: SD XL interpolations and large language model sampling

(I'm still calling this a 'pandemic diary', because the world is still sick. —CMS)

1. Puzzle Box XL

Stable Diffusion XL is a pretty heavy model, but since I love my Puzzle Box LDMs based on the earlier generation SDs, you bet your last dollar I'm training a Puzzle Box XL based on the new architecture.

Stability recommends most users to stick to low order rank adaptations, rather than full trains of their model. LoRAs are much, much easier to do with this architecture, for sure, but I have literally tens of thousands of new concepts to add to the model, many of them are abstract things, revealed or implied preferences, and I do have the compute locally to train the U-Net, at least, thoroughly. So time for more pretraining. LoRAs are for 70 billion parameter large language models, not these small beans!

This model doesn't just have a pipeline of two different denoisers, trained for earlier or later stages of the process, but it uses an ensemble of text encoders. Because ensembles and mixtures are in, you understand. Your brain is messy and haphazard and modular, why wouldn't a computer brain be? -- or so the thinking goes.

Past Stable Diffusion models have used either CLIP/ViT (1.x) or OpenCLIP (2.x) for the text encoder, and they each had different strengths. The 1.x text encoder came with really superior semantic embeddings for various art styles (generic and particular), while OpenCLIP captures spatial relations and composes features together much better than the earlier model. SD XL said "to hell with making the text encoder better, let's just use them both simultaneously and try to leverage the good bits of both models". Lazy, but it seems to work well enough, so maybe it's good lazy. You can do some fun things like conditioning the two text encoders with different input, using a different guidance scale on each, or doing different kinds of weighted averages, etc. So there are a few more knobs for playing with things. That can't be bad, right?

The fact that only the model's U-Net can fit into a single RTX 3090 (seriously, 3090s are like gold bricks blessed by God, they're the Little Engine That Could. Such a good GPU for the money) and only then just barely presents a slight difficultly, but only slight. One of the ways you do all this on a budget, of course, is using pre-calculation; if you aren't training the text encoder or the VAE, their latents are going to be fixed, why should those models be loaded into GPU memory during training when you can just read the desired latents from disk?

Yes, this means I currently have terabytes of pre-calculated VAE latents and Double-Stuf caption embeddings strewn about the floor. I took a visit to Bellingham while my machine calculated all this out, it was a 60 hour process. (It's actually a little rough working with terabytes on a SATA drive; I could and should have used an SSD.) This is what you have to do when you're running on a budget of like ten cents. I scatter captions to point out the way, like my atavistic toddler-self arranging road signs on the floor. I could train the text encoders, if I was willing to engage a second RTX 3090 (it stands ready, the loyal soldier), but no, not today. I am doing wonderful and sick (good and wild, 1980s sense of the word) things with the language models. Working with the samplers was a good idea, BTW -- it's shown me a few paths forward. But I get ahead of myself.

I've done good work on the text encoders in my past models -- that "what do Lisa Frank, Rudyard Kipling, and Bill Watterson have in common" prompt was a Codehappy special, even relatively small tencs can pull pretty impressive things from vapor -- and the fact that SD XL uses the same version of OpenCLIP that SD 2.1 did means I can just pop in my version and train with it in there. Maybe that shouldn't work as well as it does, but it does.

Anyway, I've had Puzzle Box XL in the crockpot for ~10 weeks now, and I've Frankensteined in the text encoder I trained for my SD 2.1-based models (underrated model, SD 2.x; it didn't do nudity out-of-the-box so all the waifu weenies ignored it, but my 2.x tenc is so, so solid!) but it's still not quite ready. It's a heavy model! But it's making good progress.

The loss curves on these models tend to be deceptive, seesawing up and down wildly for a long time before finding a new minimum. How do I know it's still improving? Qualitatively, and the best way I know to see this is to make spherical interpolations between prompts: improvements, or lack thereof, are much more apparent when the model is in motion. Also, they're fun!

(You also observe things about how the model organizes concepts in the semantic latent space; it gives you better appreciation for the learned structure in these models.)

Coin Collection (interpolation video.)

The model fits closely to some genuine coin designs, at least ones that were repeated in the training data. But most frames in the interpolation don't actually show real coins, they only resemble or suggest real coins. It's maybe more wild if you collect coins and can identify the genuine designs. Stepping through frame-by-frame is not only fun, it's highly suggested. Watch out for the Hapsburg chins!

Why do I have duplicated designs in the model set? Because the coins are of the same type, but a different grade; the model can be queried by grade, and it will render a coin in approximately that condition. You could turn the model around and make an automatic coin grader out of it.

I could show you animations of all kinds of cartoon characters that would melt your eyes -- just gorgeous -- the model doesn't generalize everything, but what it does is uncommonly good! -- but I don't want to go to the Disney fan art gulag, that would be an ignoble end for me. It's legendary, that's all, and that's a little scary. Do you want to be spooked for Halloween, kids? Should there be a screening of the Forbidden (For Copyright, Not For Weird Reasons) Cartoons?

I am sorely tempted to train a Zoetrope-style video model from this and my best SD 2.x model, but I've been concentrating on better LLM inference the last few months and they'll have my attention for a few months more at least. But being able to peek in and watch the progress of what, I am serious, feels like as close to computing Platonic forms as we'll ever get, is something light and wholesome to brighten my day.

2. AIpocalypse!!!!!

If you have been paying attention to the open-weights large language models space -- I have been watching it closely and helping to drive it in a few small ways: compute, tips, and time; my momma didn't raise no fool -- you're aware there has been impressive progress. A few extremely valuable open-weights pretrained models are available for research, any personal use, and almost any commercial use: Llama 2-70B, Falcon-180B, and Mistral-7B are the most important at the moment. (Collectively, these models represent more than $20 million in compute, dropped on the internet for anyone to download and use. The initial pretrain is the difficult, expensive part; there are all kinds of ways to streamline finetuning and inference now.) There are many, many finetunes of various types (LoRA, qLoRA, all-layers or partial-layers full finetunes, etc.) and of greatly varying quality. The best finetunes currently available I would characterize roughly into five groups:

  • Small, specialized finetunes. These are often built on Mistral-7B or Llama 2-13B at the moment, although some (like phi 1.5b and Gorilla 3b) are built on even smaller models. These often give state-of-the-art performance, surpassing gpt-4, Claude 2, etc. within their specialized domain; in general, it's easy to outperform a larger general model with a small specialized model, as long as you have sufficiently high-quality training data. Since they are far smaller, far faster, and far cheaper to run, they're easily the efficient frontier for businesses with LLM applications. (There are far more businesses that think they have an LLM application than businesses with an actual, factual LLM application -- we're at like dot-com levels of hype -- but let's leave that for now.)
  • General finetunes on Llama 2-70b. The best of these that are publicly available are probably XWin 70b and Synthia 70b at the moment, but more come out all the time. These models are very strong, if you're sampling correctly (more on that later) they often blow gpt-3.5 away on just about every task except perhaps code generation.
  • Coding finetunes, usually Code Llama-34b or a derivative. The 34 billion parameter version of Llama 2 is right now only available in a version that was pretrained further on large amounts of source code. Since code generation is the one area llamas are usually inferior at compared to alternatives, this is presumably a useful thing to have. Code generation is maybe the last application I have need for, but for those that want their own Copilot-style completion model, this is the area to look.
  • Systems of LMs based on Falcon-180b or Llama 2-70b experts. People sometimes ask me when I think there'll be an open-weights gpt-4 strength large language model out there. I always ask why they ask in future tense? No, there isn't one model you can download to your computer and get gpt-4 quality inference with no additional work, but gpt-4 almost certainly isn't just one model, either. You can implement a mixture-of-experts based on different trains of the strongest open models out there, plus a switch transformer controller (Llama 2-70b and Falcon-180b are the centers of attention here, though people implement MoEs with smaller models too) and there are even a few such MoEs available for download on HuggingFace. RIP your hardware, but if you absolutely positively must live on the bleeding edge frontier, and you absolutely positively must squeeze as much juice out of your oranges as you can, it's quite possible (do ensure you can get sufficient strength sampling one model, though -- you should absolutely try that first.)
  • Multimodal models. The hot one right now is LLaVA 1.5, which is Llama tied to a computer vision system so the model can describe or answer questions about images. These are super-sexy and I'm probably going to have to re-caption all of the images in my LDM pretrain dataset using this.

Ensuring that academia, small business, and regular folks like you and me have access to strong ML models and their benefits is one of the biggest fights going in tech right now. We are, in fact, drawing blood against the frontier model cartel; I am a firm believer that if you want to beat oligarchy in 'AI', the way that will actually work is to commoditize their toys. Fair competition, rather than closed-research closed-weights monopolies, is the force that will drive us forward safely. Centralized control of power is the real danger here -- there is zero chance of apocalypse via runaway AI, but a non-zero chance of apocalypse via regular ol' human stupidity and greed. (Look into your heart: you know it to be true.)

I guess I have to say something about 'apocalypse' here, because people have asked me repeatedly if I think "AI" will actually wreck the world, and there's lots of media fretting about it, partly because fear sells, partly because there's an expedient narrative to push. No, it absolutely will not, not directly anyway. If we do ourselves in, it will be entirely our fault. I'll explain why.

Lyapunov exponents are found at the intersection of information theory and chaos theory, and the intersection of dynamic systems in physics and computer science. In astrophysics, for example, they tell you how far into the future (in the best case) you can calculate the orbit of a tumbling asteroid accurately. Tumbling asteroids, you understand, are chaotic; due to interactions with solar wind and diffuse gas clouds and particles of dust along their way, their course will subtly change in essentially unpredictable ways over time. Their shape is irregular and they aren't keeping the same face toward anything, so these effects will necessarily accumulate; there's no symmetry to cancel them out in the statistical long run. (Seriously: the differential in strength between the extremes of an asteroid in something as weak as the solar wind or interplanetary medium, over thousands or millions of years, can completely alter its orbit. This is one of the leading hypotheses on how to build a solar sail capable of relativistic interstellar travel -- but how would we aim it?)

So, for a large planet in a stable orbit, we might be able to confidently calculate its position a billion years from now. But for an especially chaotic asteroid? 100 years might be impossible. A single human lifespan can see something strange happen in astronomy, you understand; not everything unfurls over cosmic time. An asteroid can lose its own orbit like a child losing a library book inside a human lifetime. Minor, irrelevant in the grand scheme of things perhaps, but something different in the universe that you could plausibly see happen.

Asteroids are all well and good -- why do the analogies to other sciences always end in the cosmos? -- but what does this have to do with giant computational graphs?

Merely this: you can approximate a 'world model' in a computational graph to whatever accuracy you wish, you could feed it everything ever written by humans or chimpanzees or other computers, you could show it every art masterpiece or desperate danbooru darling drawing, (God help me, I've been sculpting with stardust and there are no brakes anymore!) you can scale it up to quadrillions of parameters and consume the entire Gross Galactic Product in the pretrain, feeding it stars and black holes and myths of creation, everything at the very foundation of the universe, and you will not get the TREACLES' imagined computer god out the other end, but still find this limit:

The model cannot reliably forecast outside the Lyapunov horizon. Also, the Lyapunov horizon is much closer than it appears.

You can print that on a Surgeon General's Warning sticker right now. I suggest that you do, and mass-produce them, and slap one on the forehead of every TREACLES tech bro and one on the side of every Talkie Toaster in existence. It might help. That's the sort of AI regulation I could get behind.

Space is very nearly empty, there is very little gas or dust or solar wind to move anything, and yet it is powerful enough to divert the orbits of massive bodies, sometimes over extremely short (by cosmic scale) timeframes. If you are attempting to model the universe (and let's be very clear here, a "general artificial intelligence" is actually nothing more nor less than a model of the universe; they are synonymous concepts) that sort of thing should give pause. The whole "general" in "general artificial intelligence" depends on the model being able to extrapolate to infinity and beyond, but there is always -- always -- a limit to how far extrapolation will take you, and hence a hard limit to what you can generalize from knowable or computable facts. That's a ceiling imposed by both the laws of physics and the laws of mathematics. No way around that one, we built that ceiling with the strongest materials we've got.

This means, fortunately, that the "foom scenario" -- where artificial intelligence becomes strong enough to improve itself far beyond the ability of anyone to keep up with it or control it -- is a complete fizzle; it will never happen. The real world, "everything everywhere all at once", is a huge chaotic dynamic system, and you don't have perfect knowledge of the starting conditions, so you cannot predict reliably enough to become an eater of worlds. There is similarly a ceiling to self-supervised learning (it's high enough it doesn't matter much) but it does mean (and this could be fortunate or unfortunate depending on how you look at it) that there will always be significant limitations to the technology.

This is the sort of thing that shouldn't even have to be said ("no, we aren't creating a computer god, get a hold of yourself") but there's a lot of people, including lots of people that should know better, going apoplectic with the thought that The Singularity is Nigh or the dread paperclip optimizer will choose the company of paperclips over human beings.

With enough compute, we can do anything computable... but it turns out that most of the hardest problems that face us aren't actually computable and there isn't one perfect solution, or even a solution that's provably optimal under constraints. Chaos theory demands this, even in circumstances of perfect knowledge of the initial conditions. Lyapunov smiles down on us from heaven, or wherever it is that old Soviet mathematicians go, and gently soothes us with the blessed knowledge of the certain unknowability of everything. Computation will be our servant, not our master, unless we set ourselves into servitude.

The TREACLES can fall down and worship their silicon calf if they like. I am a firm believer in the freedom of religion and in following the dictates of your own spirit. It's stupid, though. Artificial brains are only so impressive when you can cast them at home. (We live in a world where any schmuck with 256 GB of RAM, not even any gpus!!!, can qLoRA a Falcon-180b -- yes, it's glacial, but data quality >> data quantity for this purpose, LIMA you know, and switch transformer mixtures of experts are pretty easy to implement -- let he that hath ears to hear, hear, but what that gobbledygook means is that, regardless of regulation or future developments, strong AI proliferation is happening like it or not. Sam Altman and Elon Musk and Peter Theil and Mustafa Suleyman and Josh Haley and Karla Ortiz and Susan Sarandon can complain about it all they like -- and, make no mistake, they are all playing for the same team -- but the strong AI of the future will be open and available for anybody to do with as they please, and it will be very difficult to actually make money from it. Again, you beat them by commoditizing their toys: it's as simple as that. People like me are ratchets and reminders, and I'm not the only one!)

I am, I suppose, committing here the sin of Prometheus and of Adam both: the sin of knowing things. I'm a heretic by the TREACLES' faith, but could it be any other way?

It is worth saying that, yes, people will do terrible things with strong ML models; I would first emphasize that governments and large companies around the world and billionaires with bad taste in everything are already doing terrible things with it. China's whole surveillance state runs on it already. Only the spookiest of spooks knows what the NSA is doing with it here at home in America (are you sufficiently spooked for Hallowe'en, folks?) Don't kid yourselves on this point. Even if we get lucky and the current government doesn't do anything too nasty with it, we're only gonna be lucky for so long. There are reasons authoritarians around the world have been moist in the trousers the past couple years.

Centralization of the strongest models just gives the authoritarians a super-easy target. They want them because, even though they can't take over the world with them, they can act like secret police everywhere, the state's own robot Stasi; they are excellent at essentially data-mining large amounts of unstructured text, so they'll make detecting and tracking political dissidents or other undesirables much simpler. They also can give a layer of unaccountability to whatever awful thing the state wants to do -- relegate decisions about dirty work to the model, and if it is a mistake, well, blame the computer, I suppose; it's an easy and convenient scapegoat. That's it, that's ballgame. Doesn't really matter that it's inconvenient to release strong model weights, that a few super-wealthy people aren't going to become even more superhumanly wealthy, that it might in asymptote effectively raise the average IQ in the world by 10 points and that's troublesome for various sociopolitical reasons -- those things are skittles, small beer, compared to stable authoritarian capture. If you get a really sick fuck in charge, the atrocities he'll cook up with all our data could make Room 101 a reality (unlike Winston, I wouldn't get rats, which are goofy little babies, I remind you). You could have kinda-sorta enforcable laws around wrongthink: that kind of techodystopia. That is the only actual "existential risk" around AI that I can point to, that might actually happen; it's also the only "existential risk" we can actually prevent. (Climate change is still way more likely to wipe us out, sorry to say. None of the supposed dangers around "AI" even rate in comparison.)

And before the looming threat of the dictator, there are the mundane threats of the monopolists. Regulate AI as a utility, for sure, but never let it be a Ma Bell-style monopoly, nor even a cartel monopoly of a few big providers like the current telco/cable provider situation. That's bad enough when it's somebody jacking up the price of your data plan or gouging you for (HBO) Max, but imagine how bad it would be if they're managing a new trillion dollar industry, perhaps eventually the majority of economic activity.

Maybe you think that I am wrong, and Lyapunov is wrong, and that so-called "AI safety" is the most important thing, and Yudkowsky has a point, and for the good of everybody the strongest ML models must be locked away as closely guarded secrets, only available to select few behind an opaque API. Even in that world, even allowing the "safety" arguments (which, I repeat, are all bunkum), the closed model is far more dangerous than the open model, for the simple reason that nobody can actually verify the safety of the closed model!

There is no way to do sound, replicable research with a blackbox API-access model. You need the weights. There is absolutely no other way. This is the one thing that a "frontier model" company will never give you; they will even limit your API access in various ways, to avoid their model escaping via distillation. And how do you know for sure that the model is the same version, or that the same version will remain available in the future for others to replicate your research? None of the "frontier model" companies will or can guarantee that in a verifiable way.

I would add that it is emphatically not the case that extremely large language models are "safer" or "more steerable" than small models; quite the opposite is true. Extremely large language models, in the 100 billion parameters and larger range, tend to imitate even rarer, even more idiosyncratic behaviors in the training set. There are what are termed "negative scaling factors" in the literature. Larger models are spontaneously toxic more often, for example. Larger models are more likely to memorize falsehoods in the training data (conspiracy theories, e.g.) Larger models can imitate human cognitive biases, or even mental illnesses (consuming schizophrenic posts, e.g.)

There is no way to train an LLM that will make it "safe", in the sense that it will unfailingly avoid the space of these undesirable outputs. You might suppress 99% of undesired output with a lot of RLHF/RLAIF and a separate classification model as an additional check, although false positives will almost certainly outnumber true positives. The only way known to 100% prevent that from happening is by scrubbing the training data to the point where the model never learns about or is able to imply the undesired thing; this is generally not a practical thing to do, since language models can successfully extrapolate from training data, you have to limit the model's general usefulness to prevent so-called "unsafe" outputs entirely. (There is also research on 'machine forgetting', removing "unsafe" knowledge from the model, but the problem with that is much the same as the 'never learn the bad thing' solution: it's never one feature alone extricable, weights in well-trained LLMs are highly polysemous and are responsible for different actions on different feature activations, and removing the "unsafe" content invariably harms the model as a whole.)

RLHF to prevent 'hallucination' is in fact, in an important sense, completely misguided and the wrong thing to do. This is because LLMs don't actually reason or plan; a single token inference in a non-recurrent computational graph necessarily runs in constant time. This means, in order to do a task successfully, they have to at some stage be able to take a wild guess and be right. This is the diametric opposite of being truthful, understand: LLMs require 'hallucination' in order to function -- every extrapolation outside of the training data is hallucinated, right or wrong. Internalize that. When you work on your LLM, you aren't trying to create HAL or 'AGI' or Asimov's harmless helpful assistant, you're building a better bullshitter, and that's all.

Maybe this sounds dismissive, but it isn't meant that way! When the model is a sufficiently good bullshitter, it's also very useful! Utility and bullshittery are strongly positively correlated here! What a world, huh?

So, how do you build your better bullshitter? You've read this far, you probably want to know. Why is it, what secret can there possibly be that allows any bozo with a strong workstation to confidently wreck frontier models using off-the-shelf components on a budget of approximately nothing? It's actually really simple. You could perhaps answer in one word, 'asymmetry'.

3. Building a Better Bullshitter

You must remember that every single "model-as-a-service" (MaaS) company has a rather severe vulnerability: they have to scrimp on inference costs in order to scale, or else do heavy loss-leading. They're attempting to serve millions or billions of queries, as quickly and as cheaply as possible, and you are not.

Thus, everything that you can do in your LLM sampling code to improve output is worth doing. Everything that you can do there is pure gravy to you, and pure detriment to them. That which improves inference is your strength and your shield, but a millstone around their neck. An asymmetry, you could say, and where you find these sorts of asymmetries, they really have to be exploited. It's a crying shame otherwise.

This means the MaaS crowd, however strong their secret sauce, lose to somebody platinum-plating their local sampler; the fact is, nucleus/top-k sampling is largely irrelevant to predicting text successfully, much like a Gaussian model of noise isn't actually necessary to make a diffusion model converge. It just makes the math/code a little easier.

Classifier-free guidance? Fitting generations to BNF grammars? Adaptive mirostat sampling? Running a mixture of experts? Adding an elaborate retrieval-based system or knowledge graph to increase factual accuracy? All of these things improve generation in almost any use case; all (and more) should be considered.

Let's talk about classifier-free guidance, for an example. Basically, for CFG, you do a negative generation first -- something that will predict text with qualities opposite/unlike what you want -- and then when you generate for real, you penalize logits that are heavily favored by the negative generation. A well-crafted CFG can add poetry and spice to anything. It's like the mirror universe, much sexier inverted version of reinforcement learning-with-feedback. The blighted MaaS companies have to do RLHF instead of this, because CFG doubles your inference cost on every query and doubling inference costs at scale would flat-out break them, while RLHF is trained into the model and in theory something you only have to do once. (In practice of course, they have to keep doing it forever, because RLHF/RLAIF isn't actually capable of productionizing the unproductionizable; unfortunately, the sand shifts under them as they build. It's a never-ending trial of Heracles to get it to work against adversarial users without damaging your model beyond repair. Sucks to be them, doesn't it?)

It's amusing to think about, a little delicious in fact, but it's the cheap stuff (for anybody that isn't worried about serving at scale) that is good here, sturdy and strong and reliable, and the expensive stuff that doesn't actually work right.

There is an active community on Kaggle, of course, working on better LLM inference, as well as integration with external retrieval-based systems, including regular competitions (here's a recent one); I am not a big believer myself in chasing benchmarks for LLMs, they don't capture enough, but even so, what you learn along the way is important and every one of the solutions presented there is worth reading if you're interested in practically improving your experience with your Robot Friend. Lots of interesting approaches to explore can be found in those writeups: different implementations of RAG systems, finetuned models based on synthetic data, etc. (I notice that just calculating embeddings of all Wikipedia for retrieval and having good ol' XWin 70b do the work was good enough for third place... the cash money prizes for the top 5 are a little interesting. Hm.)

The easiest way to beat gpt-4 or Gemini generally isn't to build a trillion-parameter LM, you understand, it's to sample your 180 billion parameter LM smarter; there is a helpful market asymmetry to exploit to your advantage. How else does David slay Goliath?

Working on the inference in ways akin to this has given me a fairly Big Idea that, while probably not new to the frontier labs, certainly hasn't been published before. If I had King Charles over for tea, the language model thing I'd want to show him wouldn't be GPT 4V or Gemini. It would be this.

You know how you can fake multimodality, to a degree, by training in special tokens that encode image data or audio streams or instructions to use a specific app/tool? What if you train special tokens that are control hints to the sampler, about (say) in-context memorization, or tone, or the proper negative rubric to use in classifier-free guidance? How about adjusting perplexity targets in mirostat sampling according to those control tokens?

I'm working it out on a toy model (~1B parameters) now locally, and assembling the full-size pretrain set, annotating mountains of text and code (some by hand, some automated: it's like the Sorceror's Apprentice over here) with these control tokens. Finding the compute to train the big boy will be the hard part, but unless we do something spectacularly stupid on the regulation front (quite possible, the Powers That Be are entirely clueless about this stuff!) there will be plenty of compute for this kind of application and it should be possible in another year or so.

And about that: All the people talking now about "compute governance" and "model licensing" and "CERN for AI" are fighting last year's battles. That is the past. Everybody -- not literally everybody, but everybody that's paying attention -- knows the score; there isn't actually a secret here, the garden of machine learning is fertilized with data and watered by compute. Proliferation is inevitable now; none of this is going to stay contained. Have you seen NVIDIA's numbers? Do some quick coffeetable figuring on how much compute is out there: you could train gpt-4 hundreds of times on what's just lying around. The compute is abundant, everybody that's anybody has their own personal rich, deep data lake of legitimate and dubious provenance. If you wanted 'containment' we should have stopped with transitors, much less transformers; the real bitch is that general computation is, in fact, general. I've alluded to what you can do with Falcon-180b and zero gpus; even the relatively poor can participate; from there it isn't hard to see why this sort of approach to the thing is doomed. Laws of economics aren't quite as strong as the laws of physics, but they are far stronger than the laws of humans.

The only regulatory question around "AI" that's actually real is this one: will you be permitted to benefit from the advances in deep learning, or will only the rich or well-connected benefit? This is it; this is all there is. The technology isn't going anywhere, it will be used, and yes, many of the jobs we know today will disappear; the only thing that will actually be argued and potentially legislated is your access to it. And you should care about that; in terms of comparative advantage, given even close to equitable distribution of strong models, you, personally, benefit far more from this technology than the currently rich and powerful do; this seems to irk people when I point it out, I assume because it's transparently true. But it is: what does "generative AI" actually permit a large megacorp to do that it couldn't do before? What does it allow you to do? With even close to equal access, what can they do that you can't do? The only way the powerful win here, you understand, is if they're able to capture/gatekeep the technology for themselves; you having it makes you too powerful.

Yet I see a lot of people on social media -- often people claiming to be socialists or Marxists or leftists, no less! -- declaring in public, like fools or purchased men, that the means of production of the 21st century are too scary for the people to touch. I see where their heart really is, and you should, too.

I am on occasion mightily pleased that we, the little people, are a fly in the soup for the Big Guys. There isn't going to be an oligopoly run by the frontier model cartel, as long as the open-weights community is doing what we're doing. They have almost no pricing power as long as anybody with a decent computer can run their own gpt-3.5 or Midjourney at home. And they can't carry on with offering frontier model inference at loss-leader rates forever. It's their own fault, and they deserve all the anxiety that's coming to them. They'll continue to go up before Congress, and tell tall tales out of school about Skynet, maybe pee their pants and cry a little bit for emphasis, and beg the government to merk their competition, pretty-please? The laws of economics are on our side, you understand; they have to resort to evil stunts in order to win. Let's not fall for that, shall we?

The dream of a New Mainframe Age, where we all have to make offerings to the high priests of compute to have our work done by the mighty bullshit machines, and they (the barons of bullshit, our technofeudalist overlords) get to exact their levy on all economic activity, is growing dimmer and more distant in the minds of all these bastards, and it's all of our duty to destroy it utterly. It must be defeated. The problem isn't the technology, you understand: it's the value-destroying VC asswipes taking up all the space trying to bully you out of it.

Oh yes, the geopolitical implications are still super scary. There's an actual cold war on compute going: we have all kinds of export bans on GPUs and are considering them for models, too (China will keep plugging along regardless, they have millions of times more resources than I do, they'll be fine). A year later I'm still losing a little sleep at what nation-states that think they're falling too far behind -- or think they're far enough ahead -- might do. Again, keeping research out in the open, rather than locking it up internally, limits this danger, but people have collectively decided they don't want that world. It's madness, really, but no, Your Majesty -- thanks for coming to tea! -- there is nothing I or even you can do about that. No regulation, no legislation, no royal decree or papal bull can change it. I can do practically anything with a computer -- I am some kind of a wizard! -- but I can't stop one country from wanting to blow up another country. That's just human nature, and until you can change that, we're all going to have to live with the risk. Gives you the warm fuzzies all over, doesn't it? I don't think Prometheus was really wrong though, do you?

Bah and bother. I may not be able to lead the world in the dance -- who would want the responsibility even if they could? -- but I can make unto myself the desideratum. It's a weird world after all, because you won't need series C funding or a bank of Grace Hoppers to effect it, either. I'm not just the nitwit neighbor from 1905 with the giant wireless aerial in the backyard, I'm laying submarine cable and hammering an aeroplane together out of balsa wood, too. All there is to do, is to build. I will know what can be known. And someday, very soon in the astrophysical scheme of things, very distant from now in the computer's, God willing, I will die in peace upon a local minimum of regret.