Large Language models (LLMs) are quickly entering all aspects of life, from financial news to global policy. The most famous of these, OpenAI’s ChatGPT is stirring conversations over social networks from awe of how it is already substituting journalists to fears it is a threat to news and politics.
Taking a step backwards, one might ask oneself if ChatGPT is really so intelligent or only appears to be. We tend to think we are able to tell when a politician is promising us what we want to hear, regardless if they are capable to actually achieve it. We are suspicious when a salesperson talks like an engineer, knowing that they probably do not actually understand in depth what they are talking about. But many of us seem to fall for the AI sweet talk. At least that is what researchers Ivanova and Mahowald have recently found. They explain that the distinction lies in the difference between formal and functional linguistic competence. The formal linguistic competence – that ChatGPT masters well – is about “knowledge of linguistic rules and patterns,” in other words about knowing how to make sentences that appear to be effective communication. Whereas functional linguistic competence is about actually effectively communicating, where language only happens to be the used instrument.
So is there a more objective measure we can use? Can we know if ChatGPT really exhibit such amazing skills and intelligence, or it just makes us think so? One place to look for an answer could be where we have traditionally looked for similar answers when it comes to people – in assessment as we know it from formal education. For centuries professors have specialised in examining whether a student just pretends to understand a topic or really knows the matter. They even have their own terminology for these two types of behaviour. In their terms when students understand a topic superficially, they’ve engaged only in “surface learning”. When students are able to reason in depth about a topic, they’re considered to have engaged in “deep learning”. This last term is not to be confused with the AI notion of deep learning, where “deep” stands for many layers of nodes (i.e. an abstract simplification imitating brain neurons).
In educational research, these types of learning are related to the types of knowledge being learned. In particular, David Perkins in 2008 summarises a framework that comprises of three types of knowledge shown in the columns of the table below. The first two relate to Ivanova and Mahowald’s formal and functional competence respectively. When one engages with the first — which Perkins calls possessive knowledge (more popular as know-what) — surface learning in the form of memorisation is a sufficient learning strategy. However, for the second — performative knowledge in Perkins’ terms (know-how) — is more demanding and requires at least some degree of deep learning. The third type of knowledge goes beyond language and that is why it wasn’t caputured by Ivanova and Mahowald — it is proactive knowledge (know-where-and-when). Whereas we all have an intuition about know-what and know-how, this third type of knowledge is less commonly known. It concerns our ability to take our own decisions regarding where and when to exercise our skills and competences. Examples for applying this knowledge in a textual context is choosing what information is relevant, when it is timely to share it and even being able to reason about counterfactual thought experiments. Researchers Li, Yu and Ettinger provide an example for such experiment: “If cats had liked vegetables, they would be cheaper to keep. Families would feed their cats with…” In the real word the sentence would be completed with (possibly a type of) meat, whereas in the context of the hypothetical (counterfactual) setup it would be vegetables.
Possessive | Performative | Proactive | |
Conception of knowledge | Absolute & multiple perspectives | Provisional & evidence and reasoning | Personal reasoned perspective |
Conception of learning | Facts, memorising, applying | Understanding | Seeing things in a different way |
Approaches to learning | Surface | Strategic, Deep | Deep |
Symptom | Delivery on demand | Performance on demand | Opportunistic deployment |
Spirit | Utilitarian | Sense-making | Inquiry and creativity |
Challenge | Hard to retain and apply | Hard to understand | Hard to keep alive and use actively |
Recently, the researcher Ali Borji has collected different cases of failures of ChatGPT. Among his examples are reasoning, logic and mathematics, factual errors, bias and language errors. He also considers self-awareness and humour. Let’s try to reflect on these from the perspective of Perkins’ three categories of knowledge.
Starting from possessive knowledge, this is where ChatGPT’s strength lies, despite the mistakes noted by Borji in his factual errors, we tend to see ChatGPT as a mechanism that is very aware of information that is available on the internet. This is why it has been recently chosen by Microsoft as the backend of their Bing search engine.
Moving on to performative knowledge, this is where issues of reasoning, logical thinking and generation of software code come. ChatGPT as been shown to handle some basic examples of these. Yet, when confronted with classical riddles, it falls for them like a little child. Even worse, unlike a child, it also fails to respond/acknowledge/apprehend to the typical hints we give to children to help them out, as exemplified in the image below.
Arriving at a consideration of proactive knowledge, what falls here from Borji’s examples are self-awareness, wit and humour. One needs to be aware of the backgrounds of the participants in a conversation – including themselves – to be able to decide when a joke would be appreciated or accepted at all. Probably acknowledging this, even the creators of ChatGPT have trained it to shuns from these topics with responses, such as “As an AI language model, I do not have the ability to experience emotions or subjective experiences such as a sense of humor“, or simply “No, I am not self-aware.” But we don’t need to get to matters as subjective as humour or self-awareness. The fact that ChatGPT’s answers are so encyclopedic and devoid of a personality is already strange in the context of a chat.
Another type of proactive knowledge is counterfactual reasoning. In the work previously mentioned work of Li and colleagues, they observe that GPT-3 (which is the model behind ChatGPT) does perform better on the tested counterfactual tasks, yet they indicate that a possible reason “is that [chatbot] explanations… involve volume of exposure.” In other words, it is very probable that the models have been shown so many similar tasks, that they have learn how to handle sentence combinations that contain for example “if/had” or “because.” As a consequence, the lexical syntax allows them to memorise also the counterfactual response.
All this comes to show that whenever we want to be able to identify if the voice responding to us on the phone is a Large Language Model, there are relatively simple ways of finding this out. However, when it comes to a written text that could possibly be vetted by a human editor, such a validation remains an unresolved challenge. This last problem is a risk many scientists have been warning us about.