logo
EverydayChaos
Everyday Chaos
Too Big to Know
Too Big to Know
Cluetrain 10th Anniversary edition
Cluetrain 10th Anniversary
Everything Is Miscellaneous
Everything Is Miscellaneous
Small Pieces cover
Small Pieces Loosely Joined
Cluetrain cover
Cluetrain Manifesto
My face
Speaker info
Who am I? (Blog Disclosure Form) Copy this link as RSS address Atom Feed

September 5, 2024

Do LLM’s store what they’re trained on? A reason to think not.

If you ask chatGPT to give you the first paragraph of Moby-Dick, it gets it exactly right. This and other such examples are sometimes taken as evidence that it has kept a copy of that book in memory, a potential copyright violation. The makers of these Large Language Models say they don’t, or at least I haven’t come across any who cop to it.

My understanding is that they analyze the terxts they’re trained on looking for statistical relationships, keep the stats and ditch the texts. (In this, they’re similar to how search engines compile their indexes.)

But my understanding is quite limited. So, I poked around.

PROMPT: What is the first paragraph of Moby-Dick, verbatim?

CHATGPT:

Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off—then, I account it high time to get to sea as soon as I can

Perfect.

PROMPT: What is the first paragraph of David Weinberger’s book Everything Is Miscellaneous?

CHATGPT:

The first paragraph of Everything is Miscellaneous by David Weinberger reads:

“If you keep your music in iTunes or your photos in Flickr, you’re using an application that lets you put a piece of information in more than one category at a time. That’s weird. In the real world—where our species evolved—everything has its place. Our books are on one shelf at a time, our plates are in one cupboard, and our shoes are by the door or in the closet. That’s the way it’s supposed to be because that’s the way the world is. Except, of course, it’s not.”

This opening sets the stage for Weinberger’s exploration of how digital organization challenges traditional, physical-world categorization.

That’s not a bad intro to the book’s themes:  it picks up some of the examples in the book, and it is certainly reminiscent of my style of writing. But it has nothing to do with the first paragraph of the first chapter:

Before the Web, browsing was usually a polite way of telling a salesperson to buzz off. “May I help you?”, a salesperson asks. “I’m just browsing,” you reply with a little smile.With that word, a customer declares a lack of commitment, With that smile, she asserts that she’s within her rights: “Just try and stop me, salesboy!”

I’m slightly embarrassed by that, but the point is that the real first paragraph doesn’t talk about anything mentioned in chatGPT’s attempt to quote it.

Further, while chatGPT can quote the opening of Alice’s Adventures in Wonderland, if you ask it to quote the opening of Carroll’s lesser known “Sylvie and Bruno” it responds:

One winter night, at a little town called Elveston, a telegram was received by the Sub-Warden, who sat sipping his coffee, and wondering lazily whether it was worth while to finish his cigar, or whether he would take a stroll down to the Club, and see if Captain Muriel had returned yet.

Here’s the actual opening:

—and then all the people cheered again, and one man, who was more excited than the rest, flung his hat high into the air, and shouted (as well as I could make out) “Who roar for the Sub-Warden?” Everybody roared, but whether it was for the Sub-Warden, or not, did not clearly appear: some were shouting “Bread!” and some “Taxes!”, but no one seemed to know what it was they really wanted.

The phrase “one winter” doesn’t appear anywhere in the original. Nor does “a little town called Elveston”. 

Finally, if you repeat the prompt, it gives  substantially different responses, including the unfortunately smutty these days:

‘Less bread!’ cried the poor. I looked again, and saw it was not bread they had needed, but more head.

It seems to unqualified me that this bolsters the claims of the LLM makers that they don’t keep copies of their training materials (or at least don’t make them accessible to the model in operation), and that the training-time statistical analysis of texts that quote other texts, plus the magic of transformer technology,  is sufficient to explain how they can provide verbatim quotes of oft-cited works. 

Am I wrong? It’s highly likely.

Tweet
Follow me

Categories: ai, copyright, machine learning, tech Tagged with: ai • chatgpt • copyright • llm • ml • scale Date: September 5th, 2024 dw

1 Comment »

March 27, 2024

chatGPT knows its letters

A friend on a mailing list pointed to the massive palindromic story “2002” by Nick Montfort and William Gillespie, posted on Feb. 20. 2002, AKA 20-02-2002. It’s 2002 words long. And while it’s not exactly a great read, it is an amazing accomplishment.

So, I asked chatGPT to confirm that “it’s”2002” is an accurate palindrome. It confirmed it.

Then I asked it to write a palindromic story, saying it could be quite short. It came up with:

"Eva, can I see bees in a cave?"

When I asked it to write the next palindromic  sentence, it came up with:

“Eva, can I see bees in a cave? Eva, can I see bees in a cave?”

which it acknowledged is maybe too simple a way out:

This repetition isn’t the most creative continuation, but it maintains the palindromic structure. Creating a palindromic story with varied content while preserving the integrity of the palindrome is an intricate endeavor, usually resulting in very simple and repetitive narratives. The key challenge lies in the reversal of letters and sounds, which typically leads to mirrored sentences rather than continued narrative development.

This was interesting to me, because the Large Language Models that produce chat AI applications like chatGPT turn the words they’re being trained on into tokens, which are unique numeric labels for each word, but also for strings of words, parts of words, punctuation, etc. So, when chatGPT is processing a prompt that contains the word “bees” it recognizes that as a particular token number. But you need to know about the actual letters, not just the token number, to construct a palindrome. So hat’s going on?

In this case, there’s a way around this seeming limitation (which we’ll see in a minute isn’t actually a limitation): “Eva, can I see bees in a cave?” exists on the Web as an example of a palindrome. So perhaps chatGPT didn’t construct it out of nothing.

But when given:

“mom bird rib taco cat taco cat bird rib mom”

it identifies it as a palindrome even though that phrase is a sort of googleWhack: it gets zero hits on Google.

When given the entire “2002” text,  and asked if it’s a palindrome, chatGPT confidently says that it is.

It also correctly says:

“mom bird rib tackle cat taco cat bird rib mom”

is not a palindrome, and points to the exact error.

So, I asked chatGPT how it recognizes patterns in letters the compose a word, it replied:

Even though my processing of text is based on tokens, these tokens include representations for individual letters, words, and punctuation marks, which allows me to analyze and generate text based on specific linguistic rules and patterns, including identifying palindromic structures.

A good answer. Or an excellent hallucination.

And since it was my friend Drew’s grandma who pointed out “2002” to her, I can conclude:

An answer, Drew’s Nana!

PS: No, my friend isn’t really named Drew. So sue me!

PPS: In case you’re wondering why the formatting of chatGPT’s responses are so inconsistent in this post, it’s because WordPress’ “Gutenberg” block editor is an enshittification of their plain old editor. But ChatGPT guided me to the “classic” block in the Block editor, and I’ll be making a more thorough switch back to classic. I’m just too annoyed to fix the formatting on this particular post.

Tweet
Follow me

Categories: ai, blogs, everyday chaos, machine learning, tech Tagged with: ai • llm • ml Date: March 27th, 2024 dw

Be the first to comment »


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
TL;DR: Share this post freely, but attribute it to me (name (David Weinberger) and link to it), and don't use it commercially without my permission.

Joho the Blog uses WordPress blogging software.
Thank you, WordPress!