When would be justified to allow LLMS to make judgements on the real world?

Question

A big issue when it comes to LLMS discussions these days, is figuring out when an AI should be allowed to make judgements about the real world. For example, suppose gpt-4 is trained on some medical research papers, when would it be justified to pass medical diagnoses like a doctor? Or, before that even, should we take any judgement on a serious matter by an LLM seriously for that matter?

This genAI post suggests that chatgpt is already good at forming such hypotheses based on research paper.

What person's actions are being evaluated for justice - the end user? — g s, Aug 06 '23 at 21:50
Statistics should help ... I suppose ... in answering this question. What *ought* we do? — Agent Smith, Aug 07 '23 at 01:06
chat-gpt is not, at least, an extremist. an idiot, maybe. haha — , Aug 07 '23 at 14:06
I thought the practice with existing AIs (such as the diagnostic AIs that seem to be appearing quite regularly now) was to demonstrate reliability in trials and then to keep a human in the loop. I realize they are not LLMS models, but that's a starting-point at least. — Ludwig V, Aug 07 '23 at 14:58
It's already happening without AI. Anyone who uses Turbo Tax trusts the "judgement" of the software wrt the accurate preparation of taxes. Turbo Tax has been doing this for decades. — , Aug 07 '23 at 17:20

score 2 · Answer 1 · answered Aug 06 '23 at 22:22

Right now there is no so-called "artificial intelligence" that could be trusted to make such a judgement. An LLM is trained on data; later it can find a sentence or sentences that match up with the input is given, but it doesn't understand the training data. It doesn't actually understand what exactly in the training data implies which illness. It will create a model of what it thinks the correlation is, but that correlation can be completely wrong.

So you wait at the very least until you have evidence that such things don't happen. What you can do is give a suggested diagnosis, or two suggestions, and let a qualified doctor decide what the actual diagnosis is. But your actual question... We'll have to wait until the situation actually arises.

It's like the kids in school who could memorize and recall everything but didn't understand any of it. They always got As, but I was happier getting a B and knowing what it meant. — Scott Rowe, Aug 07 '23 at 12:33

score 2 · Answer 2 · answered Aug 06 '23 at 23:04

So there's a counterpart to all of the development in large language modelling, and that's the construction of Knowledge Representation systems.

A core idea in ethical AI is that the accountable judgements are not really made in the course of the language processing model deciding how to respond to your prompt, but rather that what the chat agent does is direct the user towards the appropriate parts of the represented knowledge (properly cited or tagged), and that the ontology/represented knowledge is where the questions of reliability and accountability are answered. The AI is basically just the call centre agent - the script was written by the human team behind the scenes, and it's the team, rather than the chatbot, that does the hard work of grounding the symbols in facts.

ChatGPT might, in principle, be able to reference public ontologies and semantic frameworks quite well without needing to be connected to any given individual ontology, but what's needed to make this work is our AI will need to have the capacity to demonstrate good information literacy practices as standard, and moreover people will need to keep that practice accountable. Something like a Core AI international working group might be a useful outcome of this, similar to e.g. ICANN for the internet.

Right. A book in a library can't tell you what you should do, and chat AI is basically a talking book. — Scott Rowe, Aug 07 '23 at 11:58

score 2 · Answer 3 · answered Aug 07 '23 at 13:39

This reminds me a lot of AI-based self-driving cars, which have gained quite a lot of trust already, but it also still has a long way to go. AI also makes decisions in many other places (e.g. fraud detection). I'd say ChatGPT is far behind that in terms of trust in anything resembling a regulated environment, but it might show a possible path of where such models might go in future, if trust increases.

Justifiable trust in a method or source of truth seems to come from 2 places:

Reliable results

If following the same method consistently produces results we've verified as true, this would increase trust.

ChatGPT says a lot that's right, but has frequently shown to just make stuff up, so that's not particularly reliable.

You could also question the types of reliability. If someone comes in with a common cold, you don't really want to diagnose them with cancer, or vice versa. You wouldn't really expect a competent doctor to make such mistakes, but rather to make mistakes when there's more ambiguity and uncertainty, where multiple conditions may meet the symptoms (which then arguably wouldn't be a mistake), or they'd miss something, or it's some rare disease. The mistakes of ChatGPT might be more arbitrary.

Understanding

Understanding how a method works and why and how individual outputs are generated could also increase trust.

As far as ChatGPT is concerned, we understand the low-level maths, and we understand what it does on a high level, but we can't explain why a individual outputs are generated (beyond "sticking this input into these equations give this output").

Unless something significant changes here, we'd probably need to rely primarily on reliable results, rather than understanding.

If anything, the understanding we have might decrease our trust, as we know that it just chains words together, based on what it's seen in the past, without much concern for nor ability to evaluate whether the result makes sense.

Side note: explainability is a big topic in AI, although the more complex models tend to fail there.

Also, the medical domain has a particularly high threshold for trust, as wrong decisions could very directly lead to people's deaths, and it doesn't have the benefit of, say, engineering, where every individual design can be reviewed and extensively tested before anyone's lives depend on it working (never mind failsafes).

Other domains may very well meet the threshold for trust before the medical domain would.

I could also see something like ChatGPT being used or supervised by a doctor, to aid diagnosis more than to diagnose by itself. It might also be used (mostly by private individuals, probably) in cases where doctors aren't available (although something manually curated, like WebMD, is probably better for that, even though that is not without problems).

Whenever you defer to an 'expert' without trying to understand, you're basically just rolling the dice. I try to understand medical situations for myself and family as much as I reasonably can, regardless who or what is involved. — Scott Rowe, Aug 07 '23 at 19:44

When would be justified to allow LLMS to make judgements on the real world?

3 Answers3

Reliable results

Understanding