查看原文
其他

ChatGPT是干什么的?为什么它有效?(第九部分)

梵天Daozen 梵天Daozen 2023-04-17

继续是大神Stephen Wolfram介绍ChatGPT原理系列。这部分从语言学和生成的角度,讨论了ChatGPT的一些特点和缺点,而我们对ChatGPT是怎么生成具体又有意义的句子,还缺乏非常明确的认知。


What Really Lets ChatGPT Work?

真正让 ChatGPT 发挥作用的是什么?


Human language—and the processes of thinking involved in generating it—have always seemed to represent a kind of pinnacle of complexity. And indeed it’s seemed somewhat remarkable that human brains—with their network of a “mere” 100 billion or so neurons (and maybe 100 trillion connections) could be responsible for it. Perhaps, one might have imagined, there’s something more to brains than their networks of neurons—like some new layer of undiscovered physics. But now with ChatGPT we’ve got an important new piece of information: we know that a pure, artificial neural network with about as many connections as brains have neurons is capable of doing a surprisingly good job of generating human language.

人类语言,以及产生它的思维过程,似乎总是代表着一种复杂性的顶峰。事实上,人类的大脑——“仅仅”有 1000 亿左右的神经元(可能还有 100 万亿个连接)的网络——可能就对语言负责,这似乎有些不同寻常。也许,人们可能会想,除了神经元网络之外,大脑还有更多的东西——比如一些新的未被发现的物理学层。但现在通过 ChatGPT,我们获得了一个重要的新信息:我们知道,一个纯粹的人工神经网络,当连接数量与大脑神经元数量相当,能够出色地生成人类语言。

And, yes, that’s still a big and complicated system—with about as many neural net weights as there are words of text currently available out there in the world. But at some level it still seems difficult to believe that all the richness of language and the things it can talk about can be encapsulated in such a finite system. Part of what’s going on is no doubt a reflection of the ubiquitous phenomenon (that first became evident in the example of rule 30) that computational processes can in effect greatly amplify the apparent complexity of systems even when their underlying rules are simple. But, actually, as we discussed above, neural nets of the kind used in ChatGPT tend to be specifically constructed to restrict the effect of this phenomenon—and the computational irreducibility associated with it—in the interest of making their training more accessible.

而且,是的,这依然是一个庞大而复杂的系统,神经网络的权重与世界上目前可用的文本单词一样多。但在某种程度上,似乎仍然很难相信语言的所有丰富性和它可以谈论的事物都可以封装在这样一个有限的系统中。部分正在发生的事情无疑反映了普遍存在的现象(首先在规则 30 的例子中变得明显),即计算过程实际上可以大大放大系统的表面复杂性,即使它们的基本规则很简单。但是,实际上,正如我们上面所讨论的,ChatGPT 中使用的那种神经网络往往是专门构建来限制这种现象的影响,以及与之相关的计算不可约性,来使它们的训练更容易获得。

So how is it, then, that something like ChatGPT can get as far as it does with language? The basic answer, I think, is that language is at a fundamental level somehow simpler than it seems. And this means that ChatGPT—even with its ultimately straightforward neural net structure—is successfully able to “capture the essence” of human language and the thinking behind it. And moreover, in its training, ChatGPT has somehow “implicitly discovered” whatever regularities in language (and thinking) make this possible.

那么,像 ChatGPT 这样的东西是如何在语言方面取得如此成就的呢?我认为,基本的答案是在基本层面上的语言比看起来要简单得多。这意味着 ChatGPT,即使具有最终简单的神经网络结构,也能够成功地“捕捉到”人类语言的本质及其背后的思想。此外,在其训练中,ChatGPT 以某种方式“隐含地发现”了语言(和思维)中的某种规律性使这成为可能。

The success of ChatGPT is, I think, giving us evidence of a fundamental and important piece of science: it’s suggesting that we can expect there to be major new “laws of language”—and effectively “laws of thought”—out there to discover. In ChatGPT—built as it is as a neural net—those laws are at best implicit. But if we could somehow make the laws explicit, there’s the potential to do the kinds of things ChatGPT does in vastly more direct, efficient—and transparent—ways.

我认为,ChatGPT 的成功为我们提供了一项基础且重要的科学证据:它表明我们可以期待发现新的主要“语言法则”——以及有效的“思想法则”。在作为神经网络构建的ChatGPT 中,这些法则充其量是隐含的。但是,如果我们能以某种方式使规律明确,就有可能以更直接、高效和透明的方式做 ChatGPT 所做的事情。

But, OK, so what might these laws be like? Ultimately they must give us some kind of prescription for how language—and the things we say with it—are put together. Later we’ll discuss how “looking inside ChatGPT” may be able to give us some hints about this, and how what we know from building computational language suggests a path forward. But first let’s discuss two long-known examples of what amount to “laws of language”—and how they relate to the operation of ChatGPT.

但是,好吧,那么这些规律可能是什么样的呢?最终,这些必须为我们提供某种关于语言——以及我们用语言表达的东西——是如何组合在一起的处方。稍后我们将讨论“深入了解 ChatGPT”如何能够为此提供一些提示,以及我们从构建计算语言中了解到的知识是如何为我们指明前进的道路。但首先让我们讨论两个众所周知的例子,说明什么是“语言法则”,以及它们与 ChatGPT 的运作有什么关系。

The first is the syntax of language. Language is not just a random jumble of words. Instead, there are (fairly) definite grammatical rules for how words of different kinds can be put together: in English, for example, nouns can be preceded by adjectives and followed by verbs, but typically two nouns can’t be right next to each other. Such grammatical structure can (at least approximately) be captured by a set of rules that define how what amount to “parse trees” can be put together:

首先是语言的句法。语言不仅仅是随机的单词组合。相反,怎么将不同种类的词放在一起有(相当)明确的语法规则:例如,在英语中,名词可以放在形容词之前,后面可以跟动词,但通常两个名词不能紧挨着每个名词其他。这种语法结构可以(至少近似地)通过一组规则来捕获,这些规则定义了如何将“解析树”放在一起:

ChatGPT doesn’t have any explicit “knowledge” of such rules. But somehow in its training it implicitly “discovers” them—and then seems to be good at following them. So how does this work? At a “big picture” level it’s not clear. But to get some insight it’s perhaps instructive to look at a much simpler example.

ChatGPT 对此类规则没有任何明确的“了解”。但在训练过程中,它以某种方式隐含地“发现”了它们——然后似乎很擅长模仿它们。那么这是如何工作的呢?在“大局”层面上还不清楚。但是为了获得一些洞察,看一个更简单的例子可能会有所启发。

Consider a “language” formed from sequences of (’s and )’s, with a grammar that specifies that parentheses should always be balanced, as represented by a parse tree like:

考虑一种由 ( 和 ) (注:即左右括号)序列形成的“语言”,其语法指定括号应始终平衡,如解析树所示:

Can we train a neural net to produce “grammatically correct” parenthesis sequences? There are various ways to handle sequences in neural nets, but let’s use transformer nets, as ChatGPT does. And given a simple transformer net, we can start feeding it grammatically correct parenthesis sequences as training examples. A subtlety (which actually also appears in ChatGPT’s generation of human language) is that in addition to our “content tokens” (here “(” and “)”) we have to include an “End” token, that’s generated to indicate that the output shouldn’t continue any further (i.e. for ChatGPT, that one’s reached the “end of the story”).

我们能否训练神经网络生成“语法正确”的括号序列?有多种方法可以处理神经网络中的序列,但让我们像 ChatGPT 一样使用 transformer 网络。有了一个简单的 transformer 网络之后,我们可以开始将语法正确的括号序列作为训练示例。一个微妙之处(实际上也出现在 ChatGPT 生成的人类语言中)是除了我们的“内容标记”(此处为“(”和“)”)之外,我们还必须包含一个“结束”标记,它的生成是为了表明输出不应该再继续(即对于 ChatGPT,这个结束的标记就是“故事的结尾”)。

If we set up a transformer net with just one attention block with 8 heads and feature vectors of length 128 (ChatGPT also uses feature vectors of length 128, but has 96 attention blocks, each with 96 heads) then it doesn’t seem possible to get it to learn much about parenthesis language. But with 2 attention heads, the learning process seems to converge—at least after 10 million or so examples have been given (and, as is common with transformer nets, showing yet more examples just seems to degrade its performance).

如果我们建立一个只有一个注意力块的transformer网络,它有 8 个注意力头和长度为 128 的特征向量(ChatGPT 也使用长度为 128 的特征向量,但有 96 个注意力块,每个有 96 个头)那么它似乎不可能学习到多少关于括号语言的知识。但是有了 2 个注意力头,学习过程似乎收敛了——至少在给出了 1000 万左右的例子之后(而且,就像 transformer 网络一样,展示更多的例子似乎只会降低它的性能)。

So with this network, we can do the analog of what ChatGPT does, and ask for probabilities for what the next token should be—in a parenthesis sequence:

因此,通过这个网络,我们可以模拟 ChatGPT 所做的事情,并在括号序列中询问下一个标记应该是什么的概率:

And in the first case, the network is “pretty sure” that the sequence can’t end here—which is good, because if it did, the parentheses would be left unbalanced. In the second case, however, it “correctly recognizes” that the sequence can end here, though it also “points out” that it’s possible to “start again”, putting down a “(”, presumably with a “)” to follow. But, oops, even with its 400,000 or so laboriously trained weights, it says there’s a 15% probability to have “)” as the next token—which isn’t right, because that would necessarily lead to an unbalanced parenthesis.

在第一种情况下,网络“非常确定”序列不能在这里结束——这很好,因为如果结束了,括号就会不平衡。然而,在第二种情况下,它“正确识别”序列可以在这里结束,尽管它也“指出”可以“重新开始”,在后面放一个“(”,大概是一个“)” 。但可惜的是,即使有 400,000 左右经过艰苦训练的权重,它仍然表示有 15% 的概率将“)”作为下一个标记——这是不对的,因为这必然会导致括号不平衡。

Here’s what we get if we ask the network for the highest-probability completions for progressively longer sequences of (’s:

如果我们向网络要求用最高概率的方式来生成逐渐变长的 (’s 序列,我们会得到以下结果:

And, yes, up to a certain length the network does just fine. But then it starts failing. It’s a pretty typical kind of thing to see in a “precise” situation like this with a neural net (or with machine learning in general). Cases that a human “can solve in a glance” the neural net can solve too. But cases that require doing something “more algorithmic” (e.g. explicitly counting parentheses to see if they’re closed) the neural net tends to somehow be “too computationally shallow” to reliably do. (By the way, even the full current ChatGPT has a hard time correctly matching parentheses in long sequences.)

而且,是的,在一定长度内神经网络都工作良好。但接着它开始失败。在神经网络(或一般的机器学习)这种“精度”的情况下,这是非常典型的事情。人类“一眼就能解决”的情况,神经网络也能解决。但是需要做一些“更多算法”的事情(例如显式计算括号以查看它们是否闭合)的情况下,神经网络往往“计算量太浅”而无法可靠地完成。(顺便说一句,即使是完整的当前 ChatGPT 也很难正确匹配长序列中的括号。)

So what does this mean for things like ChatGPT and the syntax of a language like English? The parenthesis language is “austere”—and much more of an “algorithmic story”. But in English it’s much more realistic to be able to “guess” what’s grammatically going to fit on the basis of local choices of words and other hints. And, yes, the neural net is much better at this—even though perhaps it might miss some “formally correct” case that, well, humans might miss as well. But the main point is that the fact that there’s an overall syntactic structure to the language—with all the regularity that implies—in a sense limits “how much” the neural net has to learn. And a key “natural-science-like” observation is that the transformer architecture of neural nets like the one in ChatGPT seems to successfully be able to learn the kind of nested-tree-like syntactic structure that seems to exist (at least in some approximation) in all human languages.

那么这对于 ChatGPT 和英语等语言的语法意味着什么呢?括号中的语言是“朴素的”——更像是一个“算法故事”。但是在英语中,能够根据当前的单词选择和其他提示“猜测”语法上适合的内容要现实得多。而且,神经网络在这方面要好得多,尽管它可能在一些“形式上正确”的案例上会犯错误,但这些案例上人类也可能会犯错。但要点是,语言的整体句法结构这一事实,以及所有隐含的规律性,在某种意义上限制了神经网络必须学习“多少”。一个关键的“类自然科学”观察是,像 ChatGPT 中的神经网络的transformer架构,似乎能够成功学习在所有人类语言中存在的那种嵌套树状句法结构(至少在一定程度上近似)。

Syntax provides one kind of constraint on language. But there are clearly more. A sentence like “Inquisitive electrons eat blue theories for fish” is grammatically correct but isn’t something one would normally expect to say, and wouldn’t be considered a success if ChatGPT generated it—because, well, with the normal meanings for the words in it, it’s basically meaningless.

句法对语言提供了一种约束。但显然还有更多。像“好奇的电子吃鱼的蓝色理论(Inquisitive electrons eat blue theory for fish)”这样的句子在语法上是正确的,但不是人们通常期望说的话,如果 ChatGPT 生成它也不会被认为是成功的,因为,好吧,如果正常去理解这句话里面的词语,那这句话基本没有什么意义。

But is there a general way to tell if a sentence is meaningful? There’s no traditional overall theory for that. But it’s something that one can think of ChatGPT as having implicitly “developed a theory for” after being trained with billions of (presumably meaningful) sentences from the web, etc.

但是有没有一种通用的方法来判断一个句子是否有意义?目前没有传统的整体理论。但我们可以将认为 ChatGPT 在用来自网络等的数十亿(可能有意义的)句子进行训练后隐含地“开发了一种理论”。

What might this theory be like? Well, there’s one tiny corner that’s basically been known for two millennia, and that’s logic. And certainly in the syllogistic form in which Aristotle discovered it, logic is basically a way of saying that sentences that follow certain patterns are reasonable, while others are not. Thus, for example, it’s reasonable to say “All X are Y. This is not Y, so it’s not an X” (as in “All fishes are blue. This is not blue, so it’s not a fish.”). And just as one can somewhat whimsically imagine that Aristotle discovered syllogistic logic by going (“machine-learning-style”) through lots of examples of rhetoric, so too one can imagine that in the training of ChatGPT it will have been able to “discover syllogistic logic” by looking at lots of text on the web, etc. (And, yes, while one can therefore expect ChatGPT to produce text that contains “correct inferences” based on things like syllogistic logic, it’s a quite different story when it comes to more sophisticated formal logic—and I think one can expect it to fail here for the same kind of reasons it fails in parenthesis matching.)

这个理论可能是什么样的?好吧,有一个小角落基本上已经为人所知两千年了东西,那就是逻辑。当然,在亚里士多德发现的三段论形式中当中,逻辑基本上是一种方式,即遵循某些模式的句子是合理的,而其他则不是。因此,例如,说“所有的 X 都是 Y。这不是 Y,所以它不是 X”是合理的(如“所有的鱼都是蓝色的。这不是蓝色的,所以它不是鱼。”)。正如人们可以开脑洞想象亚里士多德是通过大量修辞示例(“机器学习风格”)发现了三段论逻辑一样,我们也可以想象在 ChatGPT 的训练中它也是通过查看网络上的大量文本等方式“发现三段论逻辑”。(而且,是的,虽然我们因此可以期望 ChatGPT 生成包含基于诸如三段论逻辑之类的东西的“正确推论”的文本,但当到更复杂的形式逻辑ui情况就完全不同了——我认为人们可以预期它在这里会失败,原因与它在括号匹配中失败的原因相同。)

But beyond the narrow example of logic, what can be said about how to systematically construct (or recognize) even plausibly meaningful text? Yes, there are things like Mad Libsthat use very specific “phrasal templates”. But somehow ChatGPT implicitly has a much more general way to do it. And perhaps there’s nothing to be said about how it can be done beyond “somehow it happens when you have 175 billion neural net weights”. But I strongly suspect that there’s a much simpler and stronger story.

但是,除了逻辑的狭隘例子之外,关于如何系统地构建(或识别)甚至看似有意义的文本还能说些什么呢?是的,像 Mad Libs 这样的东西使用非常具体的“短语模板”。但不知何故,ChatGPT 隐含了一种更通用的方法来做到这一点。也许除了“当你有 1750 亿个神经网络权重时它会以某种方式发生”之外,关于如何完成它可能没有什么可解释的。但我强烈怀疑还有一个更简单、更强大的故事。

Meaning Space and Semantic Laws of Motion

意义空间和语义运动定律


We discussed above that inside ChatGPT any piece of text is effectively represented by an array of numbers that we can think of as coordinates of a point in some kind of “linguistic feature space”. So when ChatGPT continues a piece of text this corresponds to tracing out a trajectory in linguistic feature space. But now we can ask what makes this trajectory correspond to text we consider meaningful. And might there perhaps be some kind of “semantic laws of motion” that define—or at least constrain—how points in linguistic feature space can move around while preserving “meaningfulness”?

我们在上面讨论过,在 ChatGPT 中,任何一段文本都有效地由一组数字表示,我们可以将其视为某种“语言特征空间”中的一个点的坐标。因此,当 ChatGPT 继续一段文本时,这对应于在语言特征空间中描绘出一条轨迹。但是现在我们可以问是什么让这个轨迹对应于我们认为有意义的文本。是否可能存在某种“语义运动定律”来定义(或至少限制)语言特征空间中的点如何在四处移动的同时保持“意义”?

So what is this linguistic feature space like? Here’s an example of how single words (here, common nouns) might get laid out if we project such a feature space down to 2D:

那么这个语言特征空间是什么样的呢?下面是一个例子,说明如果我们将这样的特征空间投影到 2D,单个词(这里是普通名词)可能会这样布局:

We saw another example above based on words representing plants and animals. But the point in both cases is that “semantically similar words” are placed nearby.

我们在上面看到了另一个基于代表植物和动物的词的例子。但这两种情况的要点是“语义相似的词”被放置在附近。

As another example, here’s how words corresponding to different parts of speech get laid out:

再举一个例子,下面是对应于不同词性的词是怎么排列的:

Of course, a given word doesn’t in general just have “one meaning” (or necessarily correspond to just one part of speech). And by looking at how sentences containing a word lay out in feature space, one can often “tease apart” different meanings—as in the example here for the word “crane” (bird or machine?):

当然,一个给定的词通常并不只有“一个意思”(或者必然只对应一个词性)。通过查看包含一个词的句子如何在特征空间中布局,人们通常可以“梳理”不同的含义——就像这里的例子中的单词“crane”(鸟还是机器?):

OK, so it’s at least plausible that we can think of this feature space as placing “words nearby in meaning” close in this space. But what kind of additional structure can we identify in this space? Is there for example some kind of notion of “parallel transport” that would reflect “flatness” in the space? One way to get a handle on that is to look at analogies:

好的,那么至少我们将这个特征空间视为将“意义附近的词”相互靠近的想法是合理的。但是我们可以在这个空间中识别出什么样的附加结构呢?例如,是否存在某种的“平行传输”概念用来反映空间“平坦度”?解决这个问题的一种方法是看类比:

And, yes, even when we project down to 2D, there’s often at least a “hint of flatness”, though it’s certainly not universally seen.

而且,即使我们向下投影到 2D,通常至少会有“平坦的迹象”,尽管它肯定不是普遍存在的。

So what about trajectories? We can look at the trajectory that a prompt for ChatGPT follows in feature space—and then we can see how ChatGPT continues that:

那么轨迹呢?我们可以查看 ChatGPT 提示在特征空间中遵循的轨迹——然后我们可以看到 ChatGPT 如何继续:

There’s certainly no “geometrically obvious” law of motion here. And that’s not at all surprising; we fully expect this to be a considerably more complicated story. And, for example, it’s far from obvious that even if there is a “semantic law of motion” to be found, what kind of embedding (or, in effect, what “variables”) it’ll most naturally be stated in.

这里的运动肯定没有“几何上的显著性”规律。这一点也不奇怪;我们完全能预料到这是一个相当复杂的故事。而且,例如,即使找到了“语义运动定律”,它最自然地会在哪种嵌入(或者实际上是什么“变量”)中被陈述,这也远非显而易见。

In the picture above, we’re showing several steps in the “trajectory”—where at each step we’re picking the word that ChatGPT considers the most probable (the “zero temperature” case). But we can also ask what words can “come next” with what probabilities at a given point:

在上图中,我们展示了“轨迹”中的几个步骤——在每个步骤中,我们都选择了 ChatGPT 认为最有可能的词(“零温度”情况)。但我们也可以问,在给定的某个点,“下一个”出现的词可以是哪些,概率又是多少:

And what we see in this case is that there’s a “fan” of high-probability words that seems to go in a more or less definite direction in feature space. What happens if we go further? Here are the successive “fans” that appear as we “move along” the trajectory:

我们在这种情况下看到的是,有一个高概率词的“扇形”似乎在特征空间中或多或少地朝着确定的方向移动。如果我们走得更远会发生什么?以下是我们“沿着”轨迹“移动”时出现的连续“扇形”:

Here’s a 3D representation, going for a total of 40 steps: 这是一个 3D 表示,总共需要 40 个步骤:

And, yes, this seems like a mess—and doesn’t do anything to particularly encourage the idea that one can expect to identify “mathematical-physics-like” “semantic laws of motion” by empirically studying “what ChatGPT is doing inside”. But perhaps we’re just looking at the “wrong variables” (or wrong coordinate system) and if only we looked at the right one, we’d immediately see that ChatGPT is doing something “mathematical-physics-simple” like following geodesics. But as of now, we’re not ready to “empirically decode” from its “internal behavior” what ChatGPT has “discovered” about how human language is “put together”.

而且,这看起来的确一团糟,并且没有什么效果特别能支持人们期望通过实证研究“ChatGPT 在内部做什么”来识别“类数学物理”“语义运动定律”的想法。但也许只是因为我们在查看“错误的变量”(或错误的坐标系),假如我们能查看正确的变量,我们会立即看到 ChatGPT 正在做一些“数学-物理-简单”的事情,比如遵循测地线。但截至目前,我们还没有准备好从其“内部行为”中“经验解码”出ChatGPT“发现”了哪些关于人类语言如何“组合”的内容。



您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存