ChatGPT是干什么的？为什么它有效？（第一部分）

Original 梵天Daozen 梵天Daozen 2023-04-14

收录于合集 #大神讲给小白的ChatGPT原理 9个

前言‍

WolframAlpha之父，Stephen Wolfram写的一篇讲ChatGPT原理的长文（https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/#circle=on），深入浅出不需要有太多的预备知识即可以读懂，对于理解机器学习、神经网络、NLP的原理都很有帮助。

好厨子未必需要知道厨具是怎么造出来，老司机也不需要知道车生产的整个过程，最近AIGC大火，对于普通人来说的确也不需要对其中的技术原理有太多深挖，去学好如何利用投入生产使用更为见效。不过你如果跟我一样也好奇其中的技术实现原理，希望对它有更好的洞察，知道这些技术的能力边界和发展趋势，在众多繁杂甚至夸张的信息想法中去伪存真正本清源，减少被各种浮夸说法割韭菜，我比较推荐深入读一下这篇文章。

鉴于原文是英文，加上文章的长度实在太长，直接向大家推荐原文的话，可能对于不少朋友来说最终的归宿很有可能还是在收藏夹吃灰。网络上能找到的中文翻译也大多是机翻，直接阅读也容易出现理解偏差。所以我决定自己动手做三件事：

按照段落提供英文原文和翻译对照，其中中文翻译是在翻译工具的基础上尽可能进行校对修改，尽可能达到“信达雅”，如有翻译得不好的地方，也可以快速查看原文来理解；
根据情况补充一些注释帮助理解；
把长文拆分成若干篇发布，减少吃灰概率。

插曲：试试分别让ChatGPT和New Bing来进行翻译看看怎么样，结果前者完全理解错了任务，后者鸡贼的把任务甩给了谷歌翻译（很好奇为什么不让自家Bing翻译来干）

我好奇有多少人会去深究ChatGPT其中的“GPT”部分，是什么意思，它的全称是Generative Pre-trained Transformer，而今天这个第一部分，阐述的正是“Generative”的含义。

以下为正文（其中黑色斜体是我的注释）：

It’s Just Adding One Word at a Time 它就是一次添加一个单词而已

That ChatGPT can automatically generate something that reads even superficially like human-written text is remarkable, and unexpected. But how does it do it? And why does it work? My purpose here is to give a rough outline of what’s going on inside ChatGPT—and then to explore why it is that it can do so well in producing what we might consider to be meaningful text. I should say at the outset that I’m going to focus on the big picture of what’s going on—and while I’ll mention some engineering details, I won’t get deeply into them. (And the essence of what I’ll say applies just as well to other current “large language models” [LLMs] as to ChatGPT.)

ChatGPT可以自动生成一些东西，甚至表面上读起来像人类写的文本，这是非常了不起的，也是出乎意料的。但它是如何做到的呢？为什么它有效？我在这里的目的是粗略地概述 ChatGPT 内部发生的事情，然后探索为什么它可以在生成我们认为有意义的文本方面做得如此出色。我应该在一开始就说，我将专注于正在发生的事情的概览 - 虽然我会提到一些工程细节，但我不会深入探讨它们。（我接下来要说的，本质上同样适用于当前其他和ChatGPT相同的的“大型语言模型”[即large language models，缩写LLM]。

The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a “reasonable continuation” of whatever text it’s got so far, where by “reasonable” we mean “what one might expect someone to write after seeing what people have written on billions of webpages, etc.”

首先要解释的是，ChatGPT 从根本上一直试图做的，是为当前获得的任何文本进行“合理续写”，其中“合理”是指“人们在看到人们在数十亿个网页上写的内容后可能期望某人写什么，等等。”

So let’s say we’ve got the text “The best thing about AI is its ability to”. Imagine scanning billions of pages of human-written text (say on the web and in digitized books) and finding all instances of this text—then seeing what word comes next what fraction of the time. ChatGPT effectively does something like this, except that (as I’ll explain) it doesn’t look at literal text; it looks for things that in a certain sense “match in meaning”. But the end result is that it produces a ranked list of words that might follow, together with “probabilities”:

因此，假设我们现在有一段这样的文本：“The best thing about AI is its ability to”。想象一下，扫描数十亿页的人类书写文本（例如在网络和数字化书籍中）并找到这段文字的所有实例，然后查看这些情况下一个出现的单词是什么。ChatGPT 有效地做了这样的事情，除了它看的，不是真正意义上的文本（我后面会进行解释）；某种意义上，它寻找在“意义匹配”的东西。但最终结果是，它产生了一个可能跟随的单词的排名列表，以及“概率”：

And the remarkable thing is that when ChatGPT does something like write an essay what it’s essentially doing is just asking over and over again “given the text so far, what should the next word be?”—and each time adding a word. (More precisely, as I’ll explain, it’s adding a “token”, which could be just a part of a word, which is why it can sometimes “make up new words”.)

值得注意的是，当 ChatGPT 做一些事情（比如写一篇文章）时，它本质上只是一遍又一遍地问“给定目前的文本，下一个词应该是什么？”——然后逐次添加一个词。（更准确地说，后面我会进一步解释，它添加了一个“token”，它可能只是单词的一部分，这就是为什么它有时可以“编造新单词”。）

But, OK, at each step it gets a list of words with probabilities. But which one should it actually pick to add to the essay (or whatever) that it’s writing? One might think it should be the “highest-ranked” word (i.e. the one to which the highest “probability” was assigned). But this is where a bit of voodoo begins to creep in. Because for some reason—that maybe one day we’ll have a scientific-style understanding of—if we always pick the highest-ranked word, we’ll typically get a very “flat” essay, that never seems to “show any creativity” (and even sometimes repeats word for word). But if sometimes (at random) we pick lower-ranked words, we get a “more interesting” essay.

但是，好吧，在每一步它都会得到一个带有概率的单词列表，但是它实际上应该选择哪一个来添加到它正在写的文章（或其他）里面呢？人们可能认为它应该是“排名最高”的词（即分配给最高“概率”的词）。但这正是我们开始准备引入一点“巫术”的地方（注：意思是引入一点混乱来引发些意料之外的化学反应）。因为某种原因——也许有一天我们会有一种科学风格的理解——如果我们总是选择排名最高的词，我们通常会得到一个非常“平淡”的文章，似乎从不“表现出任何创造力”（有时甚至会逐字重复）。但如果我们有时（随机来）选择排名较低的词，我们会得到一篇“更有趣”的文章。

The fact that there’s randomness here means that if we use the same prompt multiple times, we’re likely to get different essays each time. And, in keeping with the idea of voodoo, there’s a particular so-called “temperature” parameter that determines how often lower-ranked words will be used, and for essay generation, it turns out that a “temperature” of 0.8 seems best. (It’s worth emphasizing that there’s no “theory” being used here; it’s just a matter of what’s been found to work in practice. And for example the concept of “temperature” is there because exponential distributions familiar from statistical physics happen to be being used, but there’s no “physical” connection—at least so far as we know.)

这里存在随机性这一事实意味着，如果我们多次使用相同的提示（Prompt），我们很可能每次都会得到不同的论文。而且，为了与我们的“巫术”思想保持一致，有一个特定的所谓“温度”参数决定了低排名单词的使用频率，对于论文生成，事实证明 0.8 的“温度”似乎是最好的。（值得强调的是，这里没有什么“理论”应用，只是在实践中发现这么来最有效。例如我们之所以用“温度”这样的概念，是因为碰巧使用的指数分布和统计物理学中的相似，但跟实际的物理没有联系——起码就我们目前所知是这样。）

Before we go on I should explain that for purposes of exposition I’m mostly not going to use the full system that’s in ChatGPT; instead I’ll usually work with a simpler GPT-2 system, which has the nice feature that it’s small enough to be able to run on a standard desktop computer. And so for essentially everything I show I’ll be able to include explicit Wolfram Language code that you can immediately run on your computer. (Click any picture here to copy the code behind it.)

在我们继续之前，我应该解释一下，出于阐述的目的，大部分情况我不会使用 ChatGPT 中的完整系统；相反，我通常会使用更简单的 GPT-2 系统，这个系统的优点是它足够小，可以在标准台式计算机上运行。因此，对于我展示的所有内容，我都能够提供明确的 Wolfram 语言代码，您可以立即在您的计算机上运行这些代码。（单击此处的任何图片以复制其背后的代码。注：在原文中可以去执行和复制代码）

For example, here’s how to get the table of probabilities above. First, we have to retrieve the underlying “language model” neural net:

例如，这里是如何获得上面的概率表。首先，我们必须检索底层的“语言模型”神经网络：

Later on, we’ll look inside this neural net, and talk about how it works. But for now we can just apply this “net model” as a black box to our text so far, and ask for the top 5 words by probability that the model says should follow:

我们后面将深入了解这个神经网络，并讨论它是怎么工作的。但是现在我们可以将这个“神经网络模型”视为一个黑盒子，应用到我们的文本中，让它来回答跟随着这段文本后面概率最高的五个词是什么：

This takes that result and makes it into an explicit formatted “dataset”:

这会获取该结果并将其变成一个明确格式化的“数据集”：（注：即返回了五个最可能的单词和它们相应的概率）

Here’s what happens if one repeatedly “applies the model”—at each step adding the word that has the top probability (specified in this code as the “decision” from the model):

下面是如果重复“应用模型”会发生什么——在每一步添加具有最高概率的词（在此代码中指定为模型的“决定”）（注：即每次直接选择用判断概率最高的单词来续写）：

What happens if one goes on longer? In this (“zero temperature”) case what comes out soon gets rather confused and repetitive:

如果持续更长时间会发生什么？在这种情况下（“零温度”），很快就会出现相当混乱和重复的情况：（注：下面这段文本变成了不断重复的“营销号小编”风）

But what if instead of always picking the “top” word one sometimes randomly picks “non-top” words (with the “randomness” corresponding to “temperature” 0.8)? Again one can build up text:

但是，如果不是永远选择最高概率的词，而是有时随机选择“非最高概率”词（用对应于0.8“温度”的“随机性”）会怎么样？同样可以构建文本：

And every time one does this, different random choices will be made, and the text will be different—as in these 5 examples:

每次这样做时，都会做出不同的随机选择，文本也会不同——如以下 5 个示例所示：

It’s worth pointing out that even at the first step there are a lot of possible “next words” to choose from (at temperature 0.8), though their probabilities fall off quite quickly (and, yes, the straight line on this log-log plot corresponds to an n(–1) “power-law” decay that’s very characteristic of the general statistics of language):

值得指出的是，即使在第一步，也有很多备选的“下一个词”（温度为 0.8），尽管它们的概率下降得很快（而且，是的，这个对数对数图上的直线对应于 n (–1) “幂律”衰减，这是语言一般统计数据的典型特征）：

So what happens if one goes on longer? Here’s a random example. It’s better than the top-word (zero temperature) case, but still at best a bit weird:

那么，如果持续写下去会怎样？这是一个随机的例子。它比只选择“最高概率”词（即零温度）的情况要好，但即使这样还是有点奇怪：

This was done with the simplest GPT-2 model (from 2019). With the newer and bigger GPT-3 models the results are better. Here’s the top-word (zero temperature) text produced with the same “prompt”, but with the biggest GPT-3 model:

上面是使用最简单的 GPT-2 模型（从 2019 年开始）生成的。使用更新更大的 GPT-3 模型，结果更好。下面是使用相同的“提示”（prompt），但使用最大的 GPT-3 模型，用“最高概率”词（零温度）模式生成的文本：

And here’s a random example at “temperature 0.8”:

这是“温度 0.8”的一个随机示例：

Where Do the Probabilities Come From? 概率从何而来？

OK, so ChatGPT always picks its next word based on probabilities. But where do those probabilities come from? Let’s start with a simpler problem. Let’s consider generating English text one letter (rather than word) at a time. How can we work out what the probability for each letter should be?

好，所以 ChatGPT 总是根据概率选择下一个词。但这些概率从何而来？让我们从一个更简单的问题开始。让我们先考虑一次生成一个字母（而不是单词）的英文文本。我们如何计算出每个字母的概率应该是多少？

A very minimal thing we could do is just take a sample of English text, and calculate how often different letters occur in it. So, for example, this counts letters in the Wikipedia article on “cats”:

我们可以做的一件非常简单的事情就是取一个英文文本样本，然后计算不同字母在其中出现的频率。因此，例如，维基百科里面一篇关于“猫（Cat）”的文章中，各个字母的出现次数：

And this does the same thing for “dogs”:

“狗（Dog）”文章中不同字母的出现次数：

The results are similar, but not the same (“o” is no doubt more common in the “dogs” article because, after all, it occurs in the word “dog” itself). Still, if we take a large enough sample of English text we can expect to eventually get at least fairly consistent results:

结果相似，但不完全相同（“o”无疑在“dogs”文章中更常见，因为毕竟它出现在“dog”一词本身中）。尽管如此，如果我们采用足够大的英文文本样本，我们可以期望最终得到至少相当一致稳定的结果：

Here’s a sample of what we get if we just generate a sequence of letters with these probabilities:

如果我们只考虑这些字母出现的概率，来生成一段序列，下面是我们得到的示例：

We can break this into “words” by adding in spaces as if they were letters with a certain probability:

我们也可以把“空格”当做一种字母，然后按照一定的概率出现，来把字母序列切分为“单词”：

We can do a slightly better job of making “words” by forcing the distribution of “word lengths” to agree with what it is in English:

我们也可以通过强制“词长”的分布与英语中的一致，来更好地形成“单词”：

We didn’t happen to get any “actual words” here, but the results are looking slightly better. To go further, though, we need to do more than just pick each letter separately at random. And, for example, we know that if we have a “q”, the next letter basically has to be “u”.

我们在这里没有碰巧得到任何“实际存在的单词”，但结果看起来稍微好一些。不过，要做得更好，我们需要做的不仅仅是随机挑选每个字母。而且，例如，我们知道如果我们有一个“q”，下一个字母基本上必须是“u”。

Here’s a plot of the probabilities for letters on their own:

这是字母本身的概率分布图：

And here’s a plot that shows the probabilities of pairs of letters (“2-grams”) in typical English text. The possible first letters are shown across the page, the second letters down the page:

这是一个图表，显示了典型英文文本中字母对（“2-grams”）的概率。可能的第一个字母显示在横轴（列），第二个字母显示在纵轴（行）：（注：这是一个二维热力图，某个坐标点的颜色越深，代表横轴和纵轴分别两个字母组成的字母对出现概率越高）

And we see here, for example, that the “q” column is blank (zero probability) except on the “u” row. OK, so now instead of generating our “words” a single letter at a time, let’s generate them looking at two letters at a time, using these “2-gram” probabilities. Here’s a sample of the result—which happens to include a few “actual words”:

例如，我们在这里看到，除了“u”行之外，“q”列是空白的（概率为零）（注：即现实中上除了“qu”，不存在任何其他以q开头的“q*”字母对）。好的，现在不是采用一次生成一个字母的来“单词”的方式，而是使用这些“2-gram字母对”对应的概率一次生成两个字母。这是结果的示例——其中恰好包含一些“实际存在的单词”：

With sufficiently much English text we can get pretty good estimates not just for probabilities of single letters or pairs of letters (2-grams), but also for longer runs of letters. And if we generate “random words” with progressively longer n-gram probabilities, we see that they get progressively “more realistic”:

有了足够多的英文文本，我们不仅可以预测单个字母或字母对（2-grams）的概率，而且可以对更长的字母串的概率进行很好的估计。如果我们用逐渐加长的 n-gram 概率来生成“随机词”，我们会看到它们变得“越来越真实”：

But let’s now assume—more or less as ChatGPT does—that we’re dealing with whole words, not letters. There are about 40,000 reasonably commonly used words in English. And by looking at a large corpus of English text (say a few million books, with altogether a few hundred billion words), we can get an estimate of how common each word is. And using this we can start generating “sentences”, in which each word is independently picked at random, with the same probability that it appears in the corpus. Here’s a sample of what we get:

但是现在让我们假设——或多或少像 ChatGPT 所做的那样——我们正在处理整个单词，而不是字母。英语中大约有 40,000 个合理常用的单词。通过查看大量的英语文本语料库（比如几百万本书，总共有几千亿个单词），我们可以估计每个单词的出现频率。使用它我们可以开始生成“句子”，其中每个词都是随机独立挑选的，其出现在语料库中的概率相同。这是我们得到的示例：

Not surprisingly, this is nonsense. So how can we do better? Just like with letters, we can start taking into account not just probabilities for single words but probabilities for pairs or longer n-grams of words. Doing this for pairs, here are 5 examples of what we get, in all cases starting from the word “cat”:

毫不奇怪，这生成了一串废话。那么我们怎样才能做得更好呢？就像字母一样，我们不仅可以开始考虑单个单词的概率，还可以考虑成对甚至更长的 n-gram 单词的概率。用成对的方式来生成文本，这里有 5 个我们得到的例子，下面所有情况下都是从“猫”这个词开始：

It’s getting slightly more “sensible looking”. And we might imagine that if we were able to use sufficiently long n-grams we’d basically “get a ChatGPT”—in the sense that we’d get something that would generate essay-length sequences of words with the “correct overall essay probabilities”. But here’s the problem: there just isn’t even close to enough English text that’s ever been written to be able to deduce those probabilities.

它变得更加“聪明”。我们可能会想象，如果我们能够使用足够长的 n-gram，我们基本上会“得到一个 ChatGPT”——从某种意义上说，我们会得到一样什么东西，它能够生成文章长度的单词序列，并带有“正确的整体文章”概率”。但这就是问题所在：甚至没有足够多的英文文本语料，能够帮助推断出这些概率。

In a crawl of the web there might be a few hundred billion words; in books that have been digitized there might be another hundred billion words. But with 40,000 common words, even the number of possible 2-grams is already 1.6 billion—and the number of possible 3-grams is 60 trillion. So there’s no way we can estimate the probabilities even for all of these from text that’s out there. And by the time we get to “essay fragments” of 20 words, the number of possibilities is larger than the number of particles in the universe, so in a sense they could never all be written down.

在网络爬虫中可能有几千亿个单词；在已经数字化的书籍中，可能还有 1000 亿字。但是对于 40,000 个常用词，即使是 2-gram 的可能数量也已经是 16 亿（注：即40000*40000）——而 3-gram 的可能数量是 60 万亿。所以我们无法从现有的文本中估计所有这些的概率。而到了20个词组成的“作文碎片”，可能性的数量比宇宙中的粒子数量还要多，所以从某种意义上说，它们永远不可能全部写下来。

So what can we do? The big idea is to make a model that lets us estimate the probabilities with which sequences should occur—even though we’ve never explicitly seen those sequences in the corpus of text we’ve looked at. And at the core of ChatGPT is precisely a so-called “large language model” (LLM) that’s been built to do a good job of estimating those probabilities.

所以，我们能做些什么？最重要的想法是建立一个模型，让我们估计序列应该出现的概率——即使我们从未在我们查看的文本语料库中明确看到这些序列。ChatGPT 的核心正是所谓的“大型语言模型”（LLM），它的构建是为了很好地估计这些概率。

下集预告

下一部分将会开始了解模型的原理，即ChatGPT（及其他模型）是怎么计算所需的答案和对应的概率的。

这样的洞庭湖决堤，实在让人同情不起来

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

抗洪靠嘴，堵漏靠沙？印度官员真是绝了！

有的人走了，却永远活着

圈内疯传某谣言

ChatGPT是干什么的？为什么它有效？（第一部分）

前言‍

It’s Just Adding One Word at a Time 它就是一次添加一个单词而已

Where Do the Probabilities Come From? 概率从何而来？

下集预告

您可能也对以下帖子感兴趣

这样的洞庭湖决堤，实在让人同情不起来

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

抗洪靠嘴，堵漏靠沙？印度官员真是绝了！

有的人走了，却永远活着

圈内疯传某谣言

生成图片，分享到微信朋友圈

ChatGPT是干什么的？为什么它有效？（第一部分）

前言‍

It’s Just Adding One Word at a Time 它就是一次添加一个单词而已

Where Do the Probabilities Come From? 概率从何而来？

下集预告

您可能也对以下帖子感兴趣