I recently gave a talk for a general audience on the current state of large language models such as ChatGPT. Such models are substantial projects that are already greatly disrupting many jobs and industries. I was honestly impressed by the answers for the programming and language questions I asked it. The system could state what Python code was doing, and write correct code.
However, hallucination remains a huge problem. Hallucination, in this case, is the common name for the system generating very confident and convincing wrong answers. Here is a fun example I found recently. This is in a fresh session with no prior instructions or interactions. This is my first try of several prompts, all of which gave fundamentally incorrect answers (and most were longer and even worse than this example).
This result is from the very demonstration (or “demo”) system that is establishing OpenAI’s reputation. If it doesn’t want to answer this style of question, it would be well served by refusing to write on these topics.
Superficially, the text looks like a step by step answer. However, the document is actually a bunch of innocuous, but not-used, steps that seem to be buying time to sneak in the a quick unsupported claim of the result. There is no claim in the text implying a finite number of primes (which would be needed to drive the claimed contradiction), and the series is not finding primes (it is built up of reciprocals of them). This non-sequitur statement is not a proof, and the rest of the steps are doing nothing.
It is physically painful to work through this example (and some later examples) as they are full of malformations of familiar proof methods. One has to take extra care to read what is on the page and not what one expects on the page. Then one has to check if a small variation of the argument makes sense, and it does not. In fact the writing is eerily similar to the crank writing collected by Augustus De Morgan in A Budget of Paradoxes or by Underwood Dudley in A Budget of Trisections. The “not even wrong” pattern really feels like the result of a cut-up technique, which is already known to produce interesting texts.
To contrast, there is an entire excellent Wikipedia page dedicated to exactly this question: “Divergence of the sum of the reciprocals of the primes”. This is a case where “mere” information retrieval, a mature field, outperforms large language model generation (including possibly Retrieval Augmented Generation (RAG), if ChatGPT3.5 is using related techniques).
It may seem unfair to judge based on “a few wrong answers.” However we do need to down-weight “the great number of correct answers” by what fraction of them are cribbed and copied from other sources (Wikipedia, books, blogs, human taggers, and many more). Saying that, I’ll commit the injustice of moving from a single specific example to some wild general claims and complaints.
The current industrial AI (artificial intelligence) is moving very fast as both public and private “proof by demo.” However the demonstrations may not mean what the providers claim, as their systems include huge captive information retrieval systems and human taggers.
The “takeover by demo scheme” is ruthlessly exploiting a fallacious interpretation of the computer science concept that “checking is thought to be easier than generation”. I.e the highly technical statement that NP (non-deterministic computation or generation) may be harder than P (polynomial time computation or checking). The fallacy is: if the system is generating convincing texts it must be right, as editing or checking is easier than generation and can be added to the generation process or service itself.
However, in my opinion, the relevant analogy is prevention is better than cure. Or it is more palatable to keep the flies out of the soup than to later try to remove them later. People accept authoritative text as correct, as they are trained to do so. Large Language Models (LLMs) excel at generating authoritative looking text; but as we saw, this is an aspect now independent of correctness.
This “clear plastic report cover” AI strategy is taking over the world and drowning it in high quality spam. The scheme is auto-catalytic or positive-feedback: money looks authoritative, and looking authoritative attracts more money. This starves out actual creators, as it is more expensive to create than to copy or transcribe. It is no great surprise the next discussed step is regulatory capture, or locking out those that are behind the current leaders (see the Yann Lecun comments on regulatory capture).
A screencast of a recent lecture I prepared on LLMs in industry is available here:
Categories: Computer Science Opinion Uncategorized
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Some of the frustration is: LLM AIs are not just a Gell-Mann Amnesia effect, but their purveyors are not above adding it to their toolset.
If these AIs could do all the claim, they could then also filter out their own wrong answers (which they do not). And many outsiders assume the AIs have such a filter (which, again, they do not).
I’ve worked through one of the great proofs that the sum of the reciprocals of primes diverges here: https://github.com/WinVector/Examples/blob/main/sum_of_primes/sum_of_reciprocals_of_primes.md .