This feed does not validate.
In addition, interoperability with the widest range of feed readers could be improved by implementing the following recommendation.
<?xml version="1.0"?>
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://www.rssboard.org/rss-0.91.dtd">
<rss version="0.91">
<channel>
<title>Internet and e-mail policy and practice </title>
<link>https://jl.ly</link>
<description>John Levine's web log</description>
<language>en</language>
<item>
<title>Do you need a license to look for spam?</title>
<link>https://jl.ly/2024/01/22#spamfink</link>
<description><p>Jay Fink had an interesting little business.
If you lived in California, you could give him access to your email account,
he'd look through the spam folder for spam that appeared to violate the
state anti-spam law, and give you a spreadsheet and a file of PDFs.
You could then sue the spammers, and if you won,
you'd give Fink part of the money as his fee.</p>
<p>While the federal CAN SPAM law largely preempts state laws, it lets states
add their own penalties for fraudulent or misleading spam.
California is one of the few states with a usable law, and one of the
few that lets spam recipients sue in small claims court.
The spammers tend to pay to settle rather than going to court (because
they are pretty sure they'd lose) so this was a way to make life
more difficult for the spammers, paid for by the spammers.</p>
<p>Last July, the state of California shut him down, saying that the stuff
he was doing needed a Private Investigator (PI) license.
The license is quite expensive and requires 6,000 hours of training in
a field like arson investigation or insurance adjustment.
Fink thought this was ridiculous, since none of the training would have anything
to do with looking for spam, and the requirements were grossly excessive for what he did.
He sued the state, supported by the Institute for Justice, a libertarian public interest law firm.</p>
<p>Last week the parties filed the first substantive exchange, in which
the state moved to dismiss the case, and Fink's lawyers said not
so fast.
<hr class="seemore"></p>
<p>Fink argued that what he was doing, reading and organizing email, was
speech, and so is protected by the First Amendment.
He also made some due process claims under the Fourteenth Amendment, but
the First Amendment claims are more interesting.</p>
<p>In <a href="https://storage.courtlistener.com/recap/gov.uscourts.cand.420952/gov.uscourts.cand.420952.18.0.pdf">its brief</a>,
the state argues that no, it's not speech, and even if it is speech, it's commercial speech.
Courts have held that commercial speech is less protected than other speech in a case called <i>Central Hudson</i>.</p>
<p>They also argued that the training he'd get would be useful, e.g., in preparing evidence for a trial
or keeping people's email confidential.
And finally they argued that every kind of work involves speech, so if you believe Fink's claim that
what he's doing is speech, they couldn't require licenses of anyone.</p>
<p>The Institute for Justice filed
<a href="https://storage.courtlistener.com/recap/gov.uscourts.cand.420952/gov.uscourts.cand.420952.20.0.pdf">Fink's reply</a>.
They refute each of the state's arguments in detail.</p>
<p>They start by arguing that the state has misinterpreted many of the cases they depend on.
For example, <i>Central Hudson</i> is essentially about advertising, so while it might allow
them to regulate his ads, which are not at issue here, it's irrelevant to the work he does.
They cite cases showing that what matters for the First Amendment is what you're doing, not if you're paid for it.
They note that many people allow others to read their mail, e.g., assistants to executives, and the only difference
here is that the result might be used in a lawsuit, not a difference that they say the law recognizes.</p>
<p>They dismiss the claim that a license would make him do this job better.
(In a footnote they point out that if the state is worried about confidentiality,
they could require him to have an NDA with his clients, which he probably does anyway.)
And they rather briskly dismiss the argument that every licensed profession involves speech,
noting that a doctor or lawyer does other stuff than speech, e.g., treating patients or representing
people in court, while Fink's work is <em>only</em> speech.</p>
<p>There's more, and they also reiterate how utterly unrelated the training for a PI license is to what
Fink is doing. </p>
<p>I think Fink's argument is more persuasive here.
But since this is a motion to dismiss, all that can happen is that either the judge rules for the
state and the case is over, or for Fink and the case proceeds to the next stage.
Stay tuned.</p>
</description>
</item>
<item>
<title>Three weak copyright suits against OpenAI and one stronger one</title>
<link>https://jl.ly/2024/01/08#openai3</link>
<description><p>In the past few months there have been four similar suits filed in New York against OpenAI and Microsoft.
All four look superficially similar, and all are likely to be heard by the same judge, but one of them
is a lot stronger than the other three.
<hr class="seemore"></p>
<p>The first was filed in September by the
<a href="https://ia804703.us.archive.org/10/items/gov.uscourts.nysd.606655/gov.uscourts.nysd.606655.1.0.pdf">Authors Guild</a> on behalf of several well known fiction writers, purporting
to be a class action on behalf of every author of a work of fiction that has sold over
5,000 copies.
It lays out in great detail the many books their authors have written, and then
describes in snarky detail how LLMs and GPT work.
Then for each author, they say that one can prompt GPT and get accurate summaries of
the books, and in a few cases it wrote plot outlines of sequels.
While this certainly shows that GPT was trained on their books, the proper response is so what?</p>
<p>There is a separate complaint that the copies of the books they used to train came from
online pirate text archives.
That may well be true, but again, so what?
If OpenAI had bought an ebook copy of each book they used for training, the Guild would still
have the identical complaint.
If they object to the pirated books (which they have every right to do), they need to go
after the pirates.</p>
<p>The Guild, to which I belonged long ago, before they sued Google and lost about book scanning,
has persuaded itself that every word its members write is a priceless jewel, any
use whatsoever needs to be licensed and paid, and fair use basically doesn't exist.</p>
<p>This is, to put it mildly, not what copyright law says.
If you couldn't write summaries, book reviews would be illegal.
If someone used the plot outlines to write and publish sequels, that would be a copyright problem,
but a one-off response to a query, again, so what?
In the recent <i>Andy Warhol</i> case, the Supreme Court ruled that Warhol's art prints
based on a copyrighted photograph weren't a problem, but licensing those prints for
magazine covers in competition with the photograph was.
Two important parts of fair use are whether the use is ``transformative'', doing something
different from the original, and what effect it has on the market for the original.
In this case, a summary or possible sequel plot are not the same thing as the original book,
and the effect on the market nonexistent.
I expect this suit will be disposed of quickly in OpenAI's favor.</p>
<p>In November nonfiction writer Julian Sancton filed a very similar suit,
amended the following month to
<a href="https://ia600500.us.archive.org/14/items/gov.uscourts.nysd.610699/gov.uscourts.nysd.610699.26.0.pdf">include 11 other authors</a>, this time purporting to be a class action
on behalf of everyone who's ever written a nonfiction book.
It makes nearly identical claims that GPT was trained on their books, and complains in
slightly more detail that the copies they were trained on were pirated.
This case has been assigned to the same judge and I expect it to be equally unsuccessful.</p>
<p>In early January journalists Nicholas Basbanes and Nicholas Gage filed
<a href="https://ia801205.us.archive.org/18/items/gov.uscourts.nysd.613126/gov.uscourts.nysd.613126.1.0.pdf">yet another copycat suit</a>, again complaining that GPT was trained
on pirated copies of their books, with the class this time purporting to be every
author whose works have been used to train the LLMs.
I presume this will be consolidated with the other two cases since the classes
overlap, and will meet the same fate.</p>
<p>The one case that is somewhat stronger was filed at the end of December by
<a href="https://ia801205.us.archive.org/1/items/gov.uscourts.nysd.612697/gov.uscourts.nysd.612697.1.0.pdf">The New York Times</a>.
It makes all of the same complaints about GPT using their materinal without
permission, but unlike the other three cases, they make an argument that is
at least somewhat plausible that it's not fair use.</p>
<p>One of the attachments to the compplaint shows
<a href="https://ia801205.us.archive.org/1/items/gov.uscourts.nysd.612697/gov.uscourts.nysd.612697.1.68.pdf">a hundred examples</a> where they prompted GPT-4 with the first part of an
article, and it responded with the rest of the article or a close paraphrase.
Given the way LLMs work, one response would be to say, well, if you give it the first
half of the article, what else would you expect?
But I think this also provides some support for the argument that the way OpenAI has
used the Times' articles is too close to a substitute for the Times itself.</p>
<p>In the US, whether something is fair use is very case specific and judges have
to look at four factors listed in the law, the fourth being the effect
on the market for the work.
If the Times can make a credible argument that people use GPT to evade their paywall,
or to get their Wirecutter column's product advice without looking at the column,
that would be a strong fourth factor argument against fair use.</p>
<p>Finally, remember that "how are newspapers supposed to make money?" is an interesting question,
but not one that is particularly relevant to this case.
In the U.S. the point of copyright law is to give authors an incentive to write
stuff, but not to make any sort of promise that they'll be financially successful.
When Craigslist destroyed the classified ad business, that was great for all
of the people who can now place ads for free, financially unfortunate for
the newspapers that depended on classified ads, but it was not up to Craigslist
to replace the lost income.
In the same vein, while there are open questions of what is allowable under
fair use and what is not, "the newspapers need the money", even if true, is
not part of the discussion.</p>
<p>All four cases name Microsoft as a co-defendant, and it is obvious that the reason
they do is that Microsoft has much deeper pockets than OpenAI.
Unless OpenAI and the Times settle quickly (not out of the question since they
were negotiating before the suit was filed), this case looks like a long slog
with a great deal of discovery about exactly what training material was used,
how they used it, and duelling expert reports on what that means.</p>
</description>
</item>
<item>
<title>The Internet Archive defends Controlled Digital Lending again</title>
<link>https://jl.ly/2023/12/18#pubappeal</link>
<description><p>The Internet Archive has a program they call Controlled Digital Lending (CDL).
They have scanned a whole lot of physical books, put the books in storage, and
then lend out the scans, ensuring that each scan is lent to one person at a
time.
Publishers don't like this, sued several years ago, and the Archive lost thoroughly in April.
The judge ruled on a motion for summary judgment without a trial, which means
the judge believed there was no significant dispute about the facts.
He found that CDL was not fair use, the scans were a substitute for the paper books,
and the Archive lost.</p>
<p>Unsurprisingly, the Archive has appealed the ruling.
This looked to me to be a long shot.
The appeal is to the Second Circuit which decided the <i>Google Books</i> case.
Their decision said that Google's scanning is OK because they don't provide the
full contents of the books, but do other stuff that makes it transformative.
Since the Archive <i>does</i> provide the full contents of the books, they're
out of luck.</p>
<p>The Archive appealed in September but until recently the only activity has been
routine stuff like which lawyers will be representing whom
On Friday they filed
<a href="https://ia801400.us.archive.org/13/items/gov.uscourts.ca2.60988/gov.uscourts.ca2.60988.60.0.pdf">their brief</a>
laying out the legal theory of the appeal, and I have to say it's surprisingly strong.</p>
<p>They say that the judge misunderstood what CDL is, that he got all four prongs of the fair use
analysis wrong, and that there are significant disagreements about facts that prevent summary judgment.
<hr class="seemore"></p>
<p>They start by noting that the point of fair use is to advance the goals of copyright,
and in the U.S. that goal is to get material into the hands of people who read it.
There's no question that CDL does that, but the question is whether its benefits
are outweighed by the injury to the publishers.
Needless to say, IA says that the benefit is huge and the injury is minimal.
While this is a reasonable argument, it's also one that the judge completely
rejected earlier this year.</p>
<p>They also say that CDL provides limited access to scanned
books, which is unlike the unlimited access that Napster provided to music, or
that the hypothetical publicly available Google Books would.
They also say that CDL of purchased books is a different product than
the publisher's ebook rentals through Overdrive.
Again, this is not unreasonable, but again, the judge rejected it.</p>
<p>There is a long discussion about the unlimited lending they did during
the Covid shutdown, claiming that the number of books they lent was far
less than the number locked up in closed school libraries. (The most
borrowed book was <i>The Lion, the Witch, and the Wardrobe</i>, borrowed
888 times.)</p>
<p>What I think is their strongest argument is the one they make about the
fourth fair use factor, the financial damage to the publishers.
Both sides had experts who of course came to different conclusions.
But IA says that regardless of the analysis, that difference is
a disagreement about facts, which makes summary judgment premature.
Again, it seems reasonable, but since I haven't seen the expert reports
and I can't because they contain confidential business information and
so were filed under seal, I'll have to take their word for it.</p>
<p>There's no way the Second Circuit is going to reverse and tell IA
that they win, but I think there is a reasonable chance that they
will tell the judge that there are fact questions so the case has
to go to trial.
That would be great for IA, because at a trial, a jury makes the
decision and I would expect a jury of normal people to think
that CDL is great, the publishers are greedy, and of course
CDL is legal.</p>
<p>Next up is the publishers' reply and a lot of <i>amicus</i> briefs.
Hathitrust, the library consortium that contains the Google Books
scans, has already said they'll file one and we'll surely see many more.</p>
</description>
</item>
<item>
<title>The Internet Archive hops out of the copyright frying pan into a new and different fire</title>
<link>https://jl.ly/2023/08/14#fryingfire</link>
<description><p>In 2020 a group of book publishers sued the Internet Archive over their Controlled Digital
Lending program, which made PDF scans of books and lent them out from the Archive's
web site.
For books still in copyright, the Archive usually limited the number of copies of a book
lent to the number of physical copies of the book they had in storage.
Several publishers sued with an argument that can be summarized as
"that's not how it works".
In late March the judge made a ruling that can be summarized
as "of course that's not how it works."
(More background <a href="/Copyright_Law/nocdl.html"">here</a>.)</p>
<p>After several months of quiet negotiations, on Friday the two
parties filed
<a href="https://www.courtlistener.com/docket/17211300/214/1/hachette-book-group-inc-v-internet-archive/">a
proposed consent agreement</a> in which the Archive promised to stop it, and pay the plaintiffs
an undisclosed but presumably not huge amount of money.
The only disagreement was exactly what they promise to stop, with letters from each
to the judge explaining their positions.
<hr class="seemore"></p>
<p>The <a href="https://www.courtlistener.com/docket/17211300/214/2/hachette-book-group-inc-v-internet-archive/">publishers</a>
say that they will stop distributing scans of any of the publishers' books,
arguing that's consistent with the judge's decision, and it's up to the copyright
owner whether to publish an e-book, citing Art Spiegelman's graphic novel <i>Maus</i>
which the author decided would not work well as an e-book.
The <a href="https://www.courtlistener.com/docket/17211300/214/3/hachette-book-group-inc-v-internet-archive/">Archive</a>
argues that it should not apply to books that are not otherwise available as e-books,
making a fairly legalistic argument that the judge's decision was based on CDL's
competition with e-books, and citing another case that said the unavailability
of an e-book version favored a fair use finding.</p>
<p>Neither argument is ridiculous, but considering how dismissive the March
decision was, I was somewhat surprised
that the judge adopted the Archive's position on the same narrow legal grounds, saying that since the publishers
had only complained about books with electronic editions,
"the parties did not brief, and the Court did not decide, whether the unavailability of digital library licensing
would affect the fair-use analysis."</p>
<p>This agreement should provide a great deal of not totally unexpected relief for
the Archive.
US copyright law provides statutory damages for copyright infringement, which
in some cases lets a judge assess up to $150,000 per infringement, so if the publishers
had wanted, they could have asked for damages that would cripple or
destroy the Archive.
While it was always evident that for business and political reasons they
were unlikely to do so, it was a powerful lever to get the Archive to
do what they wanted.
The consent agreement is as mild as one could
imagine, just stop scanning our books, OK, done.</p>
<p>But on the very same day that consent agreement was filed with the court,
the Archive was served with another potentially much more damaging suit
from the music industry.
(I have no idea whether the timing was deliberate or a coincidence.
It's in the same court in New York but is assigned to a different judge.)
The Archive's <a href="https://great78.archive.org/">Great 78 Project</a> is making digital copies of 78 RPM
records from the late 1800s through the 1950s.
Until 1972 there was no Federal copyright on sound recordings and
they were only covered by a confusing tangle of state laws.
In 1972, new recordings were copyrighted, but not older ones.
In 2018 the <a href="https://en.wikipedia.org/wiki/Music_Modernization_Act">Music Modernization Act</a> (MMA)
retroactively gave Federal protection to pre-1972 recordings,
giving them extraordinarily long protection of 100 years or more.
As a result, starting in 2019 when the act came into effect,
every recording made after 1923 became copyrighted, with recordings
only coming into the public domain after a century.</p>
<p>While there was some merit in creating a consistent national
law for recording copyright, a 100 year term (110 for works between 1947 and 1956)
was a ridiculous gift to the music industry.
The law does have a process to make use of abandoned orphan works, about which more later.</p>
<p>The <a href="https://www.courtlistener.com/docket/67687248/1/umg-recordings-inc-v-internet-archive/">publisher's suit</a>
makes a straightforward complaint that the Archive is distributing copies of their records
without permission, and includes
<a href="https://www.courtlistener.com/docket/67687248/1/1/umg-recordings-inc-v-internet-archive/">a list</a>
of 2749 of them from the 1930s through 1950s.
A little spot checking of the Archive's web site confirms this, they really are doing that.</p>
<p>Another part of the Act created an onerous process for identifying orphan works,
to allow non-commercial use of them.
First you have to search to see if it's still in print, checking the Copyright Office's
database of rights owners, then use a search engine like Google, then search YouTube,
then search SoundExchange, then search Amazon for physical products, then search a
smaller specialized store, and maybe a few other places, being sure to use a variety
of search terms.
(The details are <a href="https://www.copyright.gov/music-modernization/pre1972-soundrecordings/NNUfiling-instructions.html">here</a>.)</p>
<p>If you do that and don't find the record, you go on to the next step, file a notice of
noncommercial use with the Copyright Office, using their approved PDF cover sheet
and Excel template for the works.
The Copyright Office puts them into a
<a href="https://www.copyright.gov/music-modernization/pre1972-soundrecordings/search-soundrecordings.html">public searchable database</a>.
The rights owner than has 90 days to opt out, after which noncommercial use is allowed.</p>
<p>The Archive has a page called
<a href="https://archive.org/details/unlockedrecordings">Unlocked Recordings</a> that says
"A reasonable search has been conducted to determine that these items are not commercially available"
but the complaint notes, apparently correctly, that none of them have been submitted to the
Copyright Office's database, so none of them get the orphan works safe harbor.</p>
<p>Overall, this is bad news for the Archive.
As in the CDL case, they appear to believe that the law is what they want it to be, rather than
what it actually is.
You don't have to think that the MMA's near-infinite copyright term is fair or reasonable
or good public policy to understand that nonetheless it is the law, and courts will enforce it.
The Archive is poking the music industry in the eye.
Anyone who remembers the history of Napster and P2P music sharing should realize that
the music industry has no sense of humor or proportion and when they win their suit,
which they will, they are unlikely to settle on terms as favorable as the publishers did.</p>
<p>I expect that a lot of the records in the Archive's collection really are orphaned.
They could figure out how to automate searches for many of them (their staff, some of
whom I know, are plenty smart), save the search non-results as the Copyright Office suggests,
send giant spreadsheets to the Copyright Office, and then wait.
Some of the rights holders might opt out, but most probably wouldn't, and then they'd have
a collection of legal recordings, along with documentation so other people can use them too.
I hope that when the smoke clears there's enough of the Archive left for them to do so.</p>
</description>
</item>
<item>
<title>Can large language models use the contents of your web site?</title>
<link>https://jl.ly/2023/05/07#llmcopy</link>
<description><p>Large Language Models (LLM) like GPT-4 and its front end ChatGPT work by ingesting
gigantic amounts of text from the Internet to train the model, and then responding to prompts
with text generated from those models.
Depending on who you ask, this is either one step (or maybe no steps) from Artificial General
Intelligence, or as Ted Chiang wrote in the New Yorker,
<a href="https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web"><i>ChatGPT Is a Blurry JPEG of the Web</i></a>.
While I have my opinions about that, at this point I'm considering what the relationship
is under copyright law between the input text and the output text.
Keeping in mind that I am not a lawyer, and no court has yet decided a LLM case, let's take a look.
<hr class="seemore"></p>
<p>Copyright law is about who is allowed to make copies of what, with the precise definitions
of the terms surprisingly complicated.
In this case the what is all of the web material that LLMs use
to train.</p>
<p>The next question is what's a copy. While the LLM output is rarely an
exact copy of the training material (although some of the examples in
the
<a href="https://copyrightlately.com/pdfviewer/getty-images-v-stability-ai-complaint/">Getty Images vs
Stability AI case</a> have recognizable Getty watermarks), but copies are not
just literal copies. The law says:</p>
<blockquote> A "derivative work" is a work based upon one or more preexisting
works, such as [long list of examples], or any other form in which a
work may be recast, transformed, or adapted.</blockquote>
<p>It seems to me that the output of the LLM is a work based on preexisting works,
namely, the web sites it trained on. What else could it be based on?
While there is likely some manual tweaking, that doesn't change the result. If
I take a document and translate it into another language, or take a story and turn
it into a play, it's still a derivative work. My manual work makes me a coauthor but it
doesn't wipe out the original.</p>
<p>One might make a <i>de minimis</i> argument that there is so much training data that
the amount of any particular input document in any output is too small to matter.
I find that unpersuasive.
Depending on the question, the results might be based on a million sources, or for
particularly obscure questions, a single source.
Since LLMs are notoriously unable to show their work, and as often as not make up
sources that don't exist, the burden would be on the operator of the LLM to show
how its outputs depend on its input, which at this point, they can't do.
We know that the outputs can sometimes obviously depend on identifiable inputs, e.g.,
the Getty logo example, or computer code written with distinctive variable names.</p>
<p>If the LLM output is a derivative of the training data, the next question is whether
that's OK anyway.
Under US law, ``fair use'' allows some kinds of copying.
The law does not define fair use, but gives judges
four factors to use to evaluate fair
use claims: the purpose and character of the use, the nature of the
work, the amount copied, and the market or value effect on the work.
In practice, the first factor is the most important, and
courts look for use that is <i>transformative</i>, that it is a use
different from that of the original.
For example, in the Google Books case, the court found that Google's
book scanning was transformative, because it created an index of words and
snippets for people to search, which is quite different from the
purpose of the books which is for people to read them.
On the other hand, in the recent Internet Archive case, the court
found that the purpose of the Archive's book scans was the same
as for the paper books, for people to read, so that's not transformative.</p>
<p>Defining the purpose of LLM output seems like a minefield. Maybe it's
the same as the source, maybe not, depending on the prompt.
If you ask it how many people live in China, that's a simple fact with
a vast list of possible sources, and the purpose isn't very interesting.
But what if you ask it to write python code to collect statistics from a
particular social network's web site, or to compare the political views of two
obscure French statesmen?
There aren't likely to be many sources for questions like that, the
sources are going to look a lot like what the LLM generates, and the
purpose of the sources is likely the same as of the result.</p>
<p>The other criteria for fair use are a mixed bag.
The nature of the work and amount of material used will vary
depending on the specific prompt and response.
For the market or value effect, often source pages have
ads or links to upsell to a paid service.
If the LLM output is a replacement for the source,
the user doesn't see the source so it lost the ad or potential upsell revenue, which weighs against fair use.</p>
<p>All this analysis involves a fair amount of hand waving, with a lot of the answer
depending on the details of the source material and what the LLM does
with it.
Nonetheless it is easy to imagine more situations like Getty Images
where the facts support claims that LLM output is a derivative work
and is not fair use.</p>
<p>LLM developers have not done themselves any favors by dodging this situation.
In <i>Field vs. Google</i>, the court held that it was OK for Google
to copy Field's web site into its web cache, both because it was for
the transformative purpose, and also because it is easy to opt out using
the robots exclusion protocol (ROBOTS.TXT files and the like) to tell
some or all web spiders to go away.
I have no idea how I would tell the various LLM developers not to use my web
sites, a problem they could easily have solved by also following ROBOTS.TXT.
But OpenAI says:</p>
<blockquote> The Codex model was trained on tens of millions of public repositories, which were used as training data for research purposes in the design of Codex. We believe that is an instance of transformative fair use. </blockquote>
<p>Well, maybe.</p>
<p>One other legal approach,
<a href="https://hasbrouck.org/blog/archives/002685.html">suggested by Ed Hasbrouck</a>, is to
assert the author's moral rights described in the Berne Convention on copyright, and demand
credit for use of one's material.
I don't think that's a useful approach for two reasons.
One is practical, the United States has made it quite clear in
<a href="https://www.law.cornell.edu/uscode/text/17/106A">its law</a>
that moral rights only apply to works of visual art.
Beyond that, I'd think that even in places that apply moral rights to written work,
the same derivative work analysis would apply and if an LLM used enough source material for moral
rights to apply, it'd also be enough to infringe.</p>
<blockquote> </blockquote>
</description>
</item>
</channel>
</rss>