查看原文
其他

ChatGPT是干什么的?为什么它有效?(第七部分)

梵天Daozen 梵天Daozen 2023-04-14

继续是大神Stephen Wolfram介绍ChatGPT原理系列。这部分开始讨论ChatGPT的内部结构和实现原理。‍‍


Inside ChatGPT 

进入ChatGPT


OK, so we’re finally ready to discuss what’s inside ChatGPT. And, yes, ultimately, it’s a giant neural net—currently a version of the so-called GPT-3 network with 175 billion weights. In many ways this is a neural net very much like the other ones we’ve discussed. But it’s a neural net that’s particularly set up for dealing with language. And its most notable feature is a piece of neural net architecture called a “transformer”.

好的,我们终于准备好讨论 ChatGPT 内部的内容了。而且,是的,它终究是一个巨大的神经网络——目前是所谓的 GPT-3 网络的一个版本,具有 1750 亿个权重。在许多方面,他与我们讨论过的其他网络非常相似。但它是一个专门为处理语言而设置的神经网络。它最显着的特征是一种称为“transformer”的神经网络架构。

In the first neural nets we discussed above, every neuron at any given layer was basically connected (at least with some weight) to every neuron on the layer before. But this kind of fully connected network is (presumably) overkill if one’s working with data that has particular, known structure. And thus, for example, in the early stages of dealing with images, it’s typical to use so-called convolutional neural nets (“convnets”) in which neurons are effectively laid out on a grid analogous to the pixels in the image—and connected only to neurons nearby on the grid.

在我们上面讨论的第一个神经网络中,任何给定层的每个神经元基本上都与前一层的每个神经元连接(至少有一些权重)。但是,如果一个人正在处理具有特定已知结构的数据,那么这种完全连接的网络(大概)是过度的。因此,像处理图像的早期阶段,通常使用所谓的卷积神经网络(“convnets”),其中神经元有效地布置在类似于图像中的像素的网格上 - 而且仅和网格附近的神经元连接。

The idea of transformers is to do something at least somewhat similar for sequences of tokens that make up a piece of text. But instead of just defining a fixed region in the sequence over which there can be connections, transformers instead introduce the notion of “attention”—and the idea of “paying attention” more to some parts of the sequence than others. Maybe one day it’ll make sense to just start a generic neural net and do all customization through training. But at least as of now it seems to be critical in practice to “modularize” things—as transformers do, and probably as our brains also do.

Transformer的想法是对构成一段文本的标记序列做一些某种程度上有点相似的事情。但是,Transformer 不是仅仅在序列中定义一个固定的区域,让该区域上能够有连接,而是引入了“注意力”的概念——以及“关注”序列中的某些部分而不是其他部分。也许未来有一天,启动一个通用的神经网络并通过训练进行所有定制会变得有意义。但至少就当前来说,“模块化”事物在实践中似乎很关键——就像Transformer做的那样,我们的大脑也可能是这么工作的。

OK, so what does ChatGPT (or, rather, the GPT-3 network on which it’s based) actually do? Recall that its overall goal is to continue text in a “reasonable” way, based on what it’s seen from the training it’s had (which consists in looking at billions of pages of text from the web, etc.) So at any given point, it’s got a certain amount of text—and its goal is to come up with an appropriate choice for the next token to add.

好的,那么 ChatGPT(或者更确切地说,它所基于的 GPT-3 网络)实际上做了什么?回想一下,它的总体目标是,基于它从它所接受的培训中看到的内容(包括从网络上查看数十亿页的文本等),以一种“合理”的方式继续文本,所以在任何给定的时间点,它有一定数量的文本——它的目标是为下一个要添加的token提出一个合适的选择。

It operates in three basic stages. First, it takes the sequence of tokens that corresponds to the text so far, and finds an embedding (i.e. an array of numbers) that represents these. Then it operates on this embedding—in a “standard neural net way”, with values “rippling through” successive layers in a network—to produce a new embedding (i.e. a new array of numbers). It then takes the last part of this array and generates from it an array of about 50,000 values that turn into probabilities for different possible next tokens. (And, yes, it so happens that there are about the same number of tokens used as there are common words in English, though only about 3000 of the tokens are whole words, and the rest are fragments.)

它分三个基本阶段运作。首先,它获取到目前为止与文本相对应的标记序列,并找到表示这些标记的嵌入(即数字数组)。然后它以“标准神经网络方式”对该嵌入进行操作,让值在网络中的连续层间"传递",然后生成新的嵌入(即新的数字数组)。然后它获取该数组的最后一部分,并从中生成一个包含大约 50,000 个值的数组,这些值转化为不同可能的下一个标记的概率。 (而且,是的,碰巧使用的标记数量与英语中常用单词的数量大致相同,尽管其中只有大约 3000 个标记是完整的单词,其余的都是片段。)

A critical point is that every part of this pipeline is implemented by a neural network, whose weights are determined by end-to-end training of the network. In other words, in effect nothing except the overall architecture is “explicitly engineered”; everything is just “learned” from training data.

一个关键点是,该管道的每个部分都由神经网络实现,其权重由网络的端到端训练确定。换句话说,除了整体架构之外,实际上没有任何东西是“明确设计的”;一切都只是从训练数据中“学到”的。

There are, however, plenty of details in the way the architecture is set up—reflecting all sorts of experience and neural net lore. And—even though this is definitely going into the weeds—I think it’s useful to talk about some of those details, not least to get a sense of just what goes into building something like ChatGPT.

不过,架构的设置方式有很多细节——反映了各种经验和神经网络知识。我认为讨论其中的一些细节很有用,尤其是想要了解构建像 ChatGPT 这样的东西需要什么,尽管这肯定会陷入细节。

First comes the embedding module. Here’s a schematic Wolfram Language representation for it for GPT-2:

首先是嵌入模块。这是 GPT-2 的 Wolfram 语言示意图:

The input is a vector of n tokens (represented as in the previous section by integers from 1 to about 50,000). Each of these tokens is converted (by a single-layer neural net) into an embedding vector (of length 768 for GPT-2 and 12,288 for ChatGPT’s GPT-3). Meanwhile, there’s a “secondary pathway” that takes the sequence of (integer) positions for the tokens, and from these integers creates another embedding vector. And finally the embedding vectors from the token value and the token position are added together—to produce the final sequence of embedding vectors from the embedding module.

输入是一个包含 n 个标记的向量(与上一节中一样,用 1 到大约 50,000 之间的整数表示)。这些标记中的每一个都被(通过单层神经网络)转换为一个嵌入向量(GPT-2 的长度为 768,ChatGPT 的 GPT-3 的长度为 12,288)。同时,有一个“次级路径”,它采用标记的(整数)位置序列,并从这些整数创建另一个嵌入向量。最后,将来自标记值和标记位置的嵌入向量相加,来生成来自嵌入模块的嵌入向量的最终序列。

Why does one just add the token-value and token-position embedding vectors together? I don’t think there’s any particular science to this. It’s just that various different things have been tried, and this is one that seems to work. And it’s part of the lore of neural nets that—in some sense—so long as the setup one has is “roughly right” it’s usually possible to home in on details just by doing sufficient training, without ever really needing to “understand at an engineering level” quite how the neural net has ended up configuring itself.

为什么要将令牌值和令牌位置嵌入向量加在一起?我不认为这有什么特别的科学想法。只是在已经尝试了各种不同的方法中,这种方法看起来是可行的。这是神经网络知识的一部分(在某种意义上),只要设置“大致正确”,通常只需进行足够的训练就可以了解细节,而无需真正在“工程水平”上理解,神经网络最终是怎么自行配置的。

Here’s what the embedding module does, operating on the string hello hello hello hello hello hello hello hello hello hello bye bye bye bye bye bye bye bye bye bye:

下面是嵌入模块所做的示意图:操作字符串 hello hello hello hello hello hello hello hello hello hello bye bye bye bye bye bye bye bye bye bye bye bye:

The elements of the embedding vector for each token are shown down the page, and across the page we see first a run of “hello” embeddings, followed by a run of “bye” ones. The second array above is the positional embedding—with its somewhat-random-looking structure being just what “happened to be learned” (in this case in GPT-2).

每个标记的嵌入向量的元素在页面中从上往下显示(注:可以理解一列代表一个token的嵌入),而从左往右,我们首先看到一系列“hello”嵌入,然后是一系列“bye”嵌入。上面的第二个数组是位置嵌入,其中看起来有点随机的结构正是“碰巧学到的”(在这里是 GPT-2)。

OK, so after the embedding module comes the “main event” of the transformer: a sequence of so-called “attention blocks” (12 for GPT-2, 96 for ChatGPT’s GPT-3). It’s all pretty complicated—and reminiscent of typical large hard-to-understand engineering systems, or, for that matter, biological systems. But anyway, here’s a schematic representation of a single “attention block” (for GPT-2):

好的,在嵌入模块之后,就是transformer的“重点”了:一系列所谓的“注意力块”(GPT-2 为 12 个,ChatGPT 的 GPT-3 为 96 个)。这一切都非常复杂——让人想起典型的大型难以理解的工程系统,或者生物系统。但无论如何,这是单个“注意力块”(针对 GPT-2)的示意图:

Within each such attention block there are a collection of “attention heads” (12 for GPT-2, 96 for ChatGPT’s GPT-3)—each of which operates independently on different chunks of values in the embedding vector. (And, yes, we don’t know any particular reason why it’s a good idea to split up the embedding vector, or what the different parts of it “mean”; this is just one of those things that’s been “found to work”.)

在每个这样的注意力块中,都有一组“注意力头”(GPT-2 有 12 个,ChatGPT 的 GPT-3 有 96 个)——每个注意力头都独立地对嵌入向量中的不同值块进行操作。 (而且,是的,我们不知道有什么特定的原因,让拆分嵌入向量是个好想法,或者它的不同部分“意味着”什么;这只是其中一个“被发现有效”的方法)

OK, so what do the attention heads do? Basically they’re a way of “looking back” in the sequence of tokens (i.e. in the text produced so far), and “packaging up the past” in a form that’s useful for finding the next token. In the first section above we talked about using 2-gram probabilities to pick words based on their immediate predecessors. What the “attention” mechanism in transformers does is to allow “attention to” even much earlier words—thus potentially capturing the way, say, verbs can refer to nouns that appear many words before them in a sentence.

好的,注意力头是做什么的?基本上,它们是一种这样的方式:对标记序列(即在到目前为止生成的文本中)中进行“回顾”,并“打包过去”,以便查找下一个标记。在上面的第一部分中,我们讨论了使用 2-gram 概率根据其前一个单词来选择下一个单词。而Transformer的“注意”机制所做的是允许“注意”甚至更早的单词,所以像动词和它相关的名词关系都可能捕捉到,即使名词在动词前面隔了很多个单词。

At a more detailed level, what an attention head does is to recombine chunks in the embedding vectors associated with different tokens, with certain weights. And so, for example, the 12 attention heads in the first attention block (in GPT-2) have the following (“look-back-all-the-way-to-the-beginning-of-the-sequence-of-tokens”) patterns of “recombination weights” for the “hello, bye” string above:

在更详细的层面上,注意力头所做的是用一定的权重,重新组合与不同标记相关联的嵌入向量中的块。因此,例如,第一个注意力块(在 GPT-2 中)中的 12 个注意力头上面的“hello,bye”字符串的具有以下(“回顾整个标记序列的开头”)的“重组权重”模式:

After being processed by the attention heads, the resulting “re-weighted embedding vector” (of length 768 for GPT-2 and length 12,288 for ChatGPT’s GPT-3) is passed through a standard “fully connected” neural net layer. It’s hard to get a handle on what this layer is doing. But here’s a plot of the 768×768 matrix of weights it’s using (here for GPT-2):

经过注意力头处理后,生成的“重新加权嵌入向量”(GPT-2 的长度为 768,ChatGPT 的 GPT-3 的长度为 12,288)通过标准的“全连接”神经网络层。很难掌握这一层在做什么。但这是它使用的 768×768 权重矩阵的图(此处为 GPT-2):

Taking 64×64 moving averages, some (random-walk-ish) structure begins to emerge:

采用 64×64 移动平均值,一些(随机漫步式)结构开始出现:

What determines this structure? Ultimately it’s presumably some “neural net encoding” of features of human language. But as of now, what those features might be is quite unknown. In effect, we’re “opening up the brain of ChatGPT” (or at least GPT-2) and discovering, yes, it’s complicated in there, and we don’t understand it—even though in the end it’s producing recognizable human language.

什么决定了这个结构?最终它可能是人类语言特征的某种“神经网络编码”。但目前,这些功能可能是什么还不得而知。实际上,我们正在“打开 ChatGPT(或至少是 GPT-2)的大脑”并发现,是的,它很复杂,而且我们不理解它——尽管最终它产生了可识别的人类语言.

OK, so after going through one attention block, we’ve got a new embedding vector—which is then successively passed through additional attention blocks (a total of 12 for GPT-2; 96 for GPT-3). Each attention block has its own particular pattern of “attention” and “fully connected” weights. Here for GPT-2 are the sequence of attention weights for the “hello, bye” input, for the first attention head:

好的,所以在经过一个注意力块之后,我们得到了一个新的嵌入向量——然后它连续通过其他注意力块(GPT-2 总共 12 个;GPT-3 96 个)。每个注意力块都有自己特定的“注意力”和“全连接”权重模式。对于 GPT-2,这里是第一个注意力头的“hello, bye”输入的注意力权重序列:

And here are the (moving-averaged) “matrices” for the fully connected layers:

以下是全连接层的(移动平均)“矩阵”:

Curiously, even though these “matrices of weights” in different attention blocks look quite similar, the distributions of the sizes of weights can be somewhat different (and are not always Gaussian):

有趣的是,尽管不同注意力块中的这些“权重矩阵”看起来非常相似,但权重大小的分布可能有些不同(并不总是高斯分布):

So after going through all these attention blocks what is the net effect of the transformer? Essentially it’s to transform the original collection of embeddings for the sequence of tokens to a final collection. And the particular way ChatGPT works is then to pick up the last embedding in this collection, and “decode” it to produce a list of probabilities for what token should come next.

那么在经历了所有这些注意力块之后,transformer的净效应是什么?本质上,它是将标记序列的原始嵌入集合转换为最终集合。然后,ChatGPT 的特定工作方式是选取该集合中的最后一个嵌入,并将其“解码”以生成下一个应该出现的token的概率列表。

So that’s in outline what’s inside ChatGPT. It may seem complicated (not least because of its many inevitably somewhat arbitrary “engineering choices”), but actually the ultimate elements involved are remarkably simple. Because in the end what we’re dealing with is just a neural net made of “artificial neurons”, each doing the simple operation of taking a collection of numerical inputs, and then combining them with certain weights.

这就是 ChatGPT 内部的概述。它可能看起来很复杂(尤其是因为它有许多不可避免有些武断的“工程选择”),但实际上涉及的最终元素非常简单。因为最终我们要处理的只是一个由“人工神经元”组成的神经网络,每个神经元都进行简单的操作,即获取一组数字输入,然后将它们与一定的权重组合起来。

The original input to ChatGPT is an array of numbers (the embedding vectors for the tokens so far), and what happens when ChatGPT “runs” to produce a new token is just that these numbers “ripple through” the layers of the neural net, with each neuron “doing its thing” and passing the result to neurons on the next layer. There’s no looping or “going back”. Everything just “feeds forward” through the network.

ChatGPT 的原始输入是一个数字数组(到目前为止标记的嵌入向量),当 ChatGPT “运行”以生成新标记时发生的,只是这些数字“传递”神经网络的各个层,每个神经元“做它的事情”并将结果传递给下一层的神经元。没有循环或“返回”。一切都只是通过网络“前馈”。

It’s a very different setup from a typical computational system—like a Turing machine—in which results are repeatedly “reprocessed” by the same computational elements. Here—at least in generating a given token of output—each computational element (i.e. neuron) is used only once.

它与典型的计算系统(如图灵机)的设置截然不同,在图灵机中,结果由相同的计算元素重复“重新处理”。在这里——至少在生成给定的输出标记时——每个计算元素(即神经元)只使用一次。

But there is in a sense still an “outer loop” that reuses computational elements even in ChatGPT. Because when ChatGPT is going to generate a new token, it always “reads” (i.e. takes as input) the whole sequence of tokens that come before it, including tokens that ChatGPT itself has “written” previously. And we can think of this setup as meaning that ChatGPT does—at least at its outermost level—involve a “feedback loop”, albeit one in which every iteration is explicitly visible as a token that appears in the text that it generates.

但在某种意义上,即使在 ChatGPT 中,仍然存在重用计算元素的“外循环”。因为当 ChatGPT 要生成一个新令牌时,它总是“读取”(即作为输入)在它之前的整个令牌序列,包括 ChatGPT 本身先前“写入”的令牌。我们可以认为这种设置意味着 ChatGPT 确实涉及一个“反馈循环”(至少在其最外层),尽管在这个循环中,每次迭代都作为它生成的文本中出现的标记明确可见。

But let’s come back to the core of ChatGPT: the neural net that’s being repeatedly used to generate each token. At some level it’s very simple: a whole collection of identical artificial neurons. And some parts of the network just consist of (“fully connected”) layers of neurons in which every neuron on a given layer is connected (with some weight) to every neuron on the layer before. But particularly with its transformer architecture, ChatGPT has parts with more structure, in which only specific neurons on different layers are connected. (Of course, one could still say that “all neurons are connected”—but some just have zero weight.)

但让我们回到 ChatGPT 的核心:被重复用于生成每个令牌的神经网络。在某种程度上,它非常简单:一整套相同的人工神经元。网络的某些部分仅由(“完全连接的”)神经元层组成,其中给定层上的每个神经元都连接(具有一定权重)到前一层上的每个神经元。但特别是在其 transformer 架构中,ChatGPT 的部分结构更加复杂,其中仅连接不同层上的特定神经元。 (当然,我们仍然可以说“所有神经元都是相连的”,只不过有些只是权重为零。)

In addition, there are aspects of the neural net in ChatGPT that aren’t most naturally thought of as just consisting of “homogeneous” layers. And for example—as the iconic summary above indicates—inside an attention block there are places where “multiple copies are made” of incoming data, each then going through a different “processing path”, potentially involving a different number of layers, and only later recombining. But while this may be a convenient representation of what’s going on, it’s always at least in principle possible to think of “densely filling in” layers, but just having some weights be zero.

此外,ChatGPT 中神经网络有某些方面并非最自然地被认为仅由“同质”层组成。例如——正如上面的标志性摘要所示——在一个注意力块内,有一些地方对输入数据进行“多份复制”,然后每一份都经过不同的“处理路径”,可能涉及不同数量的层,只是稍后进行重组。但是,尽管这可能是对正在发生的事情的一种方便表示,但至少在原则上总是可以考虑“密集填充”层,但只是让一些权重为零。

If one looks at the longest path through ChatGPT, there are about 400 (core) layers involved—in some ways not a huge number. But there are millions of neurons—with a total of 175 billion connections and therefore 175 billion weights. And one thing to realize is that every time ChatGPT generates a new token, it has to do a calculation involving every single one of these weights. Implementationally these calculations can be somewhat organized “by layer” into highly parallel array operations that can conveniently be done on GPUs. But for each token that’s produced, there still have to be 175 billion calculations done (and in the end a bit more)—so that, yes, it’s not surprising that it can take a while to generate a long piece of text with ChatGPT.

如果看 ChatGPT 的最长路径,大约涉及 400 个(核心)层,某种程度上来说并不是一个很大的数字。但是有数百万个神经元——总共有 1750 亿个连接,因此有 1750 亿个权重。需要意识到的一件事是,每次 ChatGPT 生成一个新token时,它进行的计算都必须涉及每一个权重。在实现上,这些计算可以在某种程度上“按层”组织成高度并行的数组操作,这些操作可以方便地在 GPU 上完成。但是对于生成的每个token,仍然需要完成 1750 亿次计算(最后还要多一点)——因此,是的,使用 ChatGPT 生成长文本可能需要一段时间也就不足为奇了。

But in the end, the remarkable thing is that all these operations—individually as simple as they are—can somehow together manage to do such a good “human-like” job of generating text. It has to be emphasized again that (at least so far as we know) there’s no “ultimate theoretical reason” why anything like this should work. And in fact, as we’ll discuss, I think we have to view this as a—potentially surprising—scientific discovery: that somehow in a neural net like ChatGPT’s it’s possible to capture the essence of what human brains manage to do in generating language.

但最终,值得注意的是,所有这些操作——虽然单独来看都很简单——可以以某种方式共同完成如此出色的“类人”文本生成工作。必须再次强调的是(至少就我们所知)没有“最终的理论原因”解释为什么这样能够有效。事实上,正如我们将要讨论的那样,我认为我们必须将此视为一个可能令人惊讶的科学发现:在某种程度上,在像 ChatGPT 这样的神经网络中,有可能捕捉到人类大脑在生成语言时所做的事情的本质.


您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存