Unfair Collection: Reclaiming Control of Publicly Available Personal Information from Data Scrapers

Andrew M. Parks* March, 2022

Rising enthusiasm for consumer data protection in the United States has resulted in several states advancing legislation to protect the privacy of their residents’ personal information. But even the newly enacted California Privacy Rights Act (CPRA)—the most comprehensive data privacy law in the country—leaves a wide-open gap for internet data scrapers to extract, share, and monetize consumers’ personal information while circumventing regulation. Allowing scrapers to evade privacy regulations comes with potentially disastrous consequences for individuals and society at large.

This Note argues that even publicly available personal information should be protected from bulk collection and misappropriation by data scrapers. California should reform its privacy legislation to align with the European Union’s General Data Privacy Regulation (GDPR), which requires data scrapers to provide notice to data subjects upon the collection of their personal information regardless of its public availability. This reform could lay the groundwork for future legislation at the federal level.

Introduction

In January 2021, a software engineer in New York City scoured dozens of city and state websites attempting to schedule a COVID-19 vaccination for his mother.¹ At that time, there was no uniform system for scheduling vaccination appointments. The city and state appointment systems were completely different, each with its own sign-up protocol.² Frustrated with this convoluted system, the engineer decided to develop a solution. In less than two weeks, he launched TurboVax, “a free website that compiles availability from the three main city and state New York vaccine systems and sends the information in real time to Twitter.”³ Because vaccine appointment information was publicly available on the internet, TurboVax could access this information using a computer program called a “bot.” This bot automatically checked, copied, and republished appointment data in bulk, avoiding the need to manually check government websites for available slots.⁴ The process that TurboVax used to extract vast amounts of data from the internet is called “scraping.”⁵

It’s one thing to scrape the internet for publicly available information when the content extracted is not associated with an individual’s personal information, but quite another when it is. When a LinkedIn user creates a public profile to search for employment, she may well include her phone number, email address, and a photo of her face. Although this information is technically “public,” she might reasonably expect this information to remain personal to her and within her control. She may, for instance, list her LinkedIn profile publicly while searching for a job but later set it to “private” after securing employment. Yet all her personal data—her name, phone number, email address, and photo—were, at least for some time, made public and therefore susceptible to extraction and reappropriation by scrapers.⁶ And this bell cannot be unrung.⁷

Over the past few years, consumer data privacy legislation has surfaced across the United States. The California Consumer Privacy Act (CCPA)⁸ and the California Privacy Rights Act (CPRA),⁹ for instance, now regulate the collection of consumer personal data and the sharing of such data with third parties. But no currently proposed or enacted privacy statute adequately protects publicly available personal information.¹⁰ All of it is exempted, making it fair game to be scraped, used, shared, or sold. Many scholars have written about data scraping and its legality under the Computer Fraud and Abuse Act.¹¹ Others have discussed various consumer data privacy statutes and proposals across the United States and Europe.¹² But few have addressed the privacy implications of scraping publicly available personal information,¹³ and no one has proposed a reform to regulate such activity in the United States. This Note does just that.

Part I of this Note defines data scraping, explains its purposes, and summarizes its current legality. Part II argues that publicly available personal information should be protected from data scrapers, analyzes the current landscape of state and federal consumer data privacy legislation, and explains why existing and proposed solutions are inadequate to address this issue. It also describes how publicly available personal information is handled by the European Union’s General Data Protection Regulation (GDPR). Part III argues that while passing legislation at the federal level could be desirable, California ought to amend its privacy laws to incorporate GDPR-style protections for publicly available personal information. Specifically, California should regulate the collection of publicly available personal information based on whether the information collected can be anonymized, whether the information is collected in bulk, and whether the information is collected for commercial purposes.

I. Data Scraping and Its Current Legality

To understand the privacy implications of data scraping, it is necessary to explain its function and legality. Scraping has many useful applications, and it is often employed by individuals serving the public interest. Unfortunately, scraping can also be used for malicious purposes, and businesses frequently attempt to block or deter parties from scraping their websites. As such, Part I concludes by examining the most common legal claims available to address scraping.

A. Scraping: Definition, Usage, and Purposes

Data scraping is the process of scanning and extracting large amounts of data from one or more websites using a software program often referred to as a “bot,” “robot,” or “scraper.”¹⁴ Scraping is different from “hacking,” which involves breaking into another person’s “computer, network, servers, or database,”¹⁵ typically by cracking a password or exploiting a vulnerability in the website’s code.¹⁶ Scrapers, by contrast, extract publicly available data¹⁷ and thus have no need to break into private servers.

Scraping has many beneficial purposes. It can be used to preserve websites, conduct research, compare product and price information from various sources, gather contact and social media data for outreach campaigns, track company reputation, and aggregate news and other content on curated websites.¹⁸ Journalists use scraping technology to gather and analyze massive chunks of statistical data.¹⁹ Scholars employ scraping technology to aid their academic research.²⁰ Advertisers use scraping technology to collect contact details and public posts on social media websites to better market their products to consumers.²¹

Although scraping has beneficial applications, scraping technology can also be used for malicious purposes, such as spamming email accounts, causing website crashes,²² or conducting scams.²³ Exemplifying morally questionable use of data scraping technology is the company Clearview AI.²⁴ Clearview scrapes billions of personal images posted on Facebook and other websites for use in its facial recognition software.²⁵ It then sells its software to law enforcement agencies, allowing police departments to “compare a face captured on a security camera against [Clearview’s] database to reveal possible matches.”²⁶ No user consents to Clearview’s collection, and even if the image is later removed from the public site, Clearview keeps a copy.²⁷ Significantly, cease-and-desist letters from Google, YouTube, Venmo, and LinkedIn have failed to stop Clearview from scraping.²⁸ Clearview has ignored the letters and maintains that it has a First Amendment right to access publicly available information.²⁹Clearview’s facial recognition software has been used by thousands of law enforcement agencies, companies, and individuals around the world.³⁰

Scraping technology is also deployed problematically in the “mugshot industry.”³¹ In this industry, private companies use bots to scrape booking photos of arrested persons from publicly accessible law enforcement websites. The companies then display the photos in “mugshot galleries” on their websites.³² Scraping enables the companies to monetize the mugshots in various ways, such as hosting advertisements on their websites, charging visitors a fee to search their mugshot database, and—most controversially—charging subjects large fees to have their mugshots removed.³³ Even if an arrested person’s criminal record is expunged, their scraped mugshot can appear in Google search results and be dispersed across dozens of websites.³⁴

To prevent scraping, website owners often prohibit the practice in their website’s terms of service³⁵ or implement technological barriers. One such barrier is the installation of a “robots.txt” file—a widely used protocol that instructs specified bots to ignore certain files when crawling or scraping a website—to their website’s root directory.³⁶ However, these technological barriers do not always effectively deter scraping. And as Section I.B will explain, the most common legal barriers to scraping do little to deter scraping publicly available personal information.

B. The Current Legal Landscape of Data Scraping

In the United States, litigation that responds to data scraping typically involves the following claims: (1) Computer Fraud and Abuse Act (CFAA) claims for scraping data “without authorization” or “exceed[ing] authorized access”; (2) state and federal copyright-infringement claims; and (3) common law trespass-to-chattels and breach-of-contract claims. ³⁷ While scholars have written extensively about whether these causes of action effectively deter or prevent scraping in general,³⁸ this Section instead focuses specifically on the failure of these causes of action to protect publicly available personal information.

1. Claims Under the Computer Fraud and Abuse Act

The Computer Fraud and Abuse Act (CFAA) imposes liability on anyone who “intentionally accesses a computer without authorization or exceeds authorized access[] and thereby obtains . . . information from any protected computer.”³⁹ In hiQ Labs, Inc. v. LinkedIn Corp., a data company used bots to scrape information that LinkedIn users included on their public profiles, such as their name, job title, work history, and skills.⁴⁰ The Ninth Circuit found that this scraping did not violate the CFAA even though LinkedIn prohibits users from scraping its website in its terms of service and employs technological barriers to block scraping.⁴¹ Instead, the court held that scraping only triggers liability under the CFAA when a website is private or password protected and a user circumvents this barrier to scrape data anyway.⁴² LinkedIn then filed a petition for a writ of certiorari to the Supreme Court.⁴³

While LinkedIn’s petition was pending, the Supreme Court decided Van Buren v. United States, its first case interpreting the CFAA.⁴⁴ In Van Buren, the Court considered whether a police officer who accessed a computer for an improper purpose “exceed[ed] authorized access” in violation of the CFAA.⁴⁵ Holding that accessing a computer for an improper purpose does not violate the CFAA, the Court adopted a “gates-up-or-down” approach: a person violates the CFAA by bypassing a “gate” that is down that the person is not supposed to bypass.⁴⁶ In other words, a person needs to enter “particular areas of the computer—such as files, folders, or databases—that are off limits to him” for liability to follow.⁴⁷

After issuing this ruling, the Supreme Court granted LinkedIn’s petition for writ of certiorari in hiQ Labs.⁴⁸ Upon review, the Court vacated the Ninth Circuit’s opinion and remanded the case for further consideration in light of the Court’s ruling in Van Buren.⁴⁹ But applying Van Buren’s “gates-up-or-down” inquiry to hiQ Labs will probably not change its outcome. The data scraped on LinkedIn’s website were publicly accessible and not protected by a password. The “gates,” therefore, were not down. As such, a person who scrapes data from a publicly accessible website likely does not violate the CFAA because that person has not bypassed a “gate” barring access to publicly available data.

In Sandvig v. Sessions, the plaintiffs argued that researchers’ use of data-scraping tools constituted access “without authorization” in violation of the CFAA.⁵⁰ Because the data sought were publicly available, the court stated that “[e]mploying a bot to crawl a website . . . may run afoul of a website’s [terms of service], but it does not constitute an access violation when the human who creates the bot is otherwise allowed to read and interact with that site.”⁵¹ Given these rulings, it is unlikely that the CFAA presents any meaningful barrier to scraping publicly available personal data.

2. Copyright Infringement, Trespass to Chattels, and Breach of Contract Claims

Like claims brought under the CFAA, claims of copyright infringement, breach of contract, and trespass to chattels are unlikely to protect individuals’ publicly available personal information from scrapers. First, when the data includes personal information—for example, an individual’s name, address, email address, phone number, geolocation data, or internet browsing history—courts tend to find that the scraping does not constitute copyright infringement because facts are not copyrightable.⁵² Copyright law distinguishes noncopyrightable facts from copyrightable works of authorship that are independently created by the author and possess at least a minimal degree of creativity.⁵³ One district court has held that scraping data from Southwest Airlines’ website did not constitute copyright infringement because “[f]are, route and scheduling information are all facts and thus not copyrightable.”⁵⁴ Personal data similarly are facts, not works of authorship, suggesting that copyright law cannot serve as a remedy for this kind of data scraping.

Second, scraping could constitute trespass to chattels—intentional interference with another’s personal property⁵⁵—if the bots used for scraping impede the website owner’s ability to use portions of its servers.⁵⁶ These claims may provide an effective method for website owners to deter scrapers from impermissibly collecting data from their websites.⁵⁷ But because most individuals are not website owners and do not host their own data, they have no trespass to chattels claim to bring against scrapers who trespass upon or impede access to web servers.

Finally, data scraping may constitute breach of contract when a website’s terms of service expressly prohibit scraping and users scrape data anyway.⁵⁸ The enforceability of an antiscraping provision in a website’s terms of service often depends on whether the agreement required the scraper to affirmatively manifest assent to its terms.⁵⁹ Even if it did, the terms ordinarily bind only the parties to the agreement—the website owner and the scraper. Thus, such agreements would not necessarily create any cause of action for individuals whose personal information is scraped from a website.⁶⁰

II. The Data Scraping Loophole

Part II of this Note argues that, where an individual’s personal information is concerned, scraping of even publicly available personal information should be regulated. While there are existing state and federal consumer data privacy laws in the United States, data scraping circumvents these proposed solutions, rendering them inadequate to address this issue. In contrast, the European Union’s General Data Protection Regulation (GDPR) provides a more robust model for amending the American legal framework on data privacy.

A. Publicly Available Personal Information Should Be Protected

Even when the information is publicly available, scraping personal information is problematic. In the absence of statutory and other legal protections for personal information, courts have held that scraping personal information is permissible so long as the information is publicly accessible.⁶¹ But personal information may be made public for various reasons, often without the knowledge or consent of the subject.⁶² And even when a subject voluntarily makes her information public, she likely does so without meaningful consent and without considering the potentially damaging implications of such a decision, both for herself and society at large.

1. Information Made Public Without the Subject’s Knowledge or Consent

In many cases, an individual never knows that their personal information has been made public, making it impossible for them to consent to its publication. Some personal information is made public through lawful government public records. For example, the Federal Election Commission (FEC) website publicly displays federal political campaign contributions. These data include each contributor’s full name, mailing address, occupation, employer, and contribution amount.⁶³ Yet individuals probably do not realize they are publicly disclosing all of this personally identifiable information when they donate to a campaign.

An individual’s personal information might also be made public when a third party publishes it online without their consent. This sometimes takes the form of “doxing”—a kind of cyber harassment involving “the public release of personal information that can be used to identify or locate an individual.”⁶⁴

Finally, personal information might also be made public as a result of a data breach.⁶⁵ Hackers frequently sell databases of stolen data records from businesses on the dark web for large sums of money.⁶⁶ If businesses delay or choose not to disclose the cyber breach to consumers, the consumers may never know their information was hacked and potentially made public.⁶⁷

Some have argued that personal data should be treated like property, owned and controlled by the individual.⁶⁸ Although current U.S. law does not recognize any definitive right of ownership to data,⁶⁹ users nevertheless might naively believe that they control theirs. After all, they can control whether to set their social profiles to “public” or “private,” and they decide whether to hide or archive content previously posted publicly. That an individual’s personal information has been published publicly on the internet should not automatically grant internet data scrapers carte blanche authority to extract, reappropriate, or monetize it. Personal information is just that: personal.

2. Information Made Public Voluntarily Should Still Be Protected

Even when an individual voluntarily makes her information public, she still retains a privacy interest in controlling it. A dissent penned by Justice Gorsuch in a different legal context—government collection of personal information from third parties for criminal investigations—provides helpful insights:

[T]he fact that a third party has access to or possession of your papers and effects does not necessarily eliminate your interest in them. Ever hand a private document to a friend to be returned? Toss your keys to a valet at a restaurant? Ask your neighbor to look after your dog while you travel? You would not expect the friend to share the document with others; the valet to lend your car to his buddy; or the neighbor to put Fido up for adoption.⁷⁰

This reasoning can be extended. For example, just because a user posts her home address on a publicly available website does not eliminate her interest in later preserving the privacy of that information. She may have made the post public only temporarily. She may have accidentally posted it publicly when she intended it to be private. Or she may have posted it to her private profile—specifically electing to make the information viewable only by a select group of friends on her account—and yet one of those friends with access may have reposted or redistributed her information publicly. In each of these scenarios, her interest in preserving the privacy of her personal information should not be completely eliminated merely because it wound up publicly accessible at least for some time.

Unless a user affirmatively changes her privacy settings on the websites and social media platforms to which she gives her data, third parties can probably access her information. Most social networking platforms make users’ content publicly accessible by default.⁷¹ But even if a prudent person were to set her profile to “private” to hide her personal information from public view, data scrapers might still be able to access it.⁷² And on non-social media websites that limit access to those with login credentials, a scraper would only need to sign up for an account to gain access.⁷³ LinkedIn’s privacy policy warns:

Please do not post or add personal data to your profile that you would not want to be publicly available. . . . Your profile is fully visible to all Members and customers of our Services. Subject to your settings, it can also be visible to others on or off of our Services (e.g., Visitors to our Services or users of third-party search engines).⁷⁴

Still, if a website includes a warning about data scraping, studies suggest users are unlikely to take heed. A 2017 survey of two thousand U.S. consumers found that 91 percent of people consent to terms of service without reading them.⁷⁵ For those aged 18 to 34, the rate was 97 percent.⁷⁶ In light of these statistics, it would be imprudent to conclude that the average user realizes that she has knowingly consented to scraping by bots if she accidentally posts a photo of herself publicly on Instagram.⁷⁷

3. The Dangers of Allowing the Scraping of Personal Information in Bulk

What flows from scrapers’ ability to extract individuals’ publicly available personal data is alarming. At its most innocuous, data scraping permits third parties to monetize our personal information without our knowledge or consent. At its most dangerous, it has the potential to vastly restrict liberty, undermine democracy, and even put people in physical danger.

The following examples highlight the very different but equally dangerous consequences of data scraping. In February 2019, an African American man named Nijeer Parks was falsely accused of shoplifting and attempting to hit a police officer with a car outside a motel in Woodbridge, New Jersey.⁷⁸ He spent ten days in jail and paid around $5,000 to defend his case.⁷⁹ Parks’s arrest stemmed from facial recognition software misidentifying him.⁸⁰ Parks later sued the city’s police department, alleging that it used facial recognition software from Clearview AI.⁸¹ While it is still unclear whether Clearview AI was used in his apprehension,⁸² the mass scraping of publicly available personal data can lead to misidentification resulting in false arrests, jail time, and thousands of dollars in attorney fees.⁸³

Mass scraping of personal information creates dangers that go beyond mere loss of privacy; it also enables cybercrime. In April 2021, Insider reported that personal data, ranging from phone numbers to locations, of over 533 million Facebook users were scraped and leaked in hacking forums.⁸⁴ Facebook confirmed that “malicious actors obtained this data not through hacking [Facebook’s] systems but by scraping it.”⁸⁵ Alon Gal, the chief technology officer of the cybercrime intelligence firm Hudson Rock, noted that the leaked data “could prove valuable to cybercriminals who use people’s personal information to impersonate them or scam them into handing over login credentials.”⁸⁶ Other researchers posit that the data could be used to gain access to individuals’ Facebook accounts, email accounts, and other social networking accounts because, once a hacker has a victim’s email address, they might be able to log into their other accounts by pairing the email address with simple passwords.⁸⁷ Phone numbers, in particular, have “taken on new significance and potential value to attackers” as they are “ubiquitous identifiers, linking you to different parts of your digital life” and “play[ing] a role in sensitive authentication.”⁸⁸ Shockingly, just days after its Facebook story, Insider reported that the personal data of over 500 million LinkedIn users were also scraped and published for sale online.⁸⁹

Scraping also has the potential to influence elections by extracting personally identifiable information in order to target individual voters. Aggregate IQ, a Canadian digital advertising and software development company, infamously influenced the United Kingdom’s 2016 EU referendum by scraping individuals’ profile information on LinkedIn and Facebook and serving them targeted ads supporting the “Vote Leave” campaign.⁹⁰ In 2016, Donald Trump’s campaign hired the political data firm Cambridge Analytica, which scraped the private information of more than fifty million Facebook users.⁹¹ The firm used these data to “identify the personalities of American voters and influence their behavior”⁹² and “orchestrate[] emotionally charged political campaigns that advanced demeaning, racialized, nationalistic propaganda.”⁹³

Finally, data scraping can place people in physical danger by easing access to individuals’ whereabouts. The story of Judge Esther Salas of the District of New Jersey illustrates the perils of publicizing personal information. In July 2020, an angered attorney sought revenge against Judge Salas for her handling of a lawsuit he filed in her court.⁹⁴ On a Sunday afternoon, the attorney showed up to Judge Salas’s home and rang her doorbell.⁹⁵ Her only son, a college student named Daniel, opened the door. The attorney fired multiple gunshots, shooting and killing Daniel. He then shot Judge Salas’s husband three times, seriously wounding him.⁹⁶

Easy access to Judge Salas’s personal information—including her home address—enabled the gunman to hunt down her family. In a New York Times op-ed, Judge Salas wrote that FBI agents informed her of how easy it is to find and purchase personal information about judges on the internet, including photos of their homes and the license plates on their vehicles.⁹⁷ In Judge Salas’s case, the gunman “was able to create a complete dossier of her life: he stalked her neighborhood, mapped her routes to work, and even learned the names of her best friend and the church she attended.”⁹⁸ This access to Judge Salas’s personal information was completely legal, and it enabled the shooter to kill her only child.⁹⁹Although it is not clear whether data scraping may have contributed to this specific incident, there is no question that data scraping could facilitate harm through collecting and publicizing personal information of the kind that allowed the gunman to arrive at Judge Salas’s door.

Using bots to scrape data in bulk from various publicly available sources makes it easier to collect and compile an abundance of personal information for potentially malicious purposes. Scraping enables its practitioners to more easily create a “complete dossier” of an individual’s life. It can increase false arrests and influence elections. It’s a useful tool for scammers, stalkers, and scoundrels. And what’s worse, as the next Section explains, is that no existing or proposed legislation restricts scraping publicly available personal information.

B. Scraping Personal Information Circumvents Current and Proposed Privacy Laws

Existing and proposed consumer privacy laws fail to adequately protect individuals’ personal information from data scrapers. Indeed, there is currently no comprehensive data privacy legislation enacted at the federal level.¹⁰⁰ However, responding to rising enthusiasm for consumer data privacy protection, several states have enacted or introduced legislation to protect the privacy of their residents’ personal information, including California,¹⁰¹ New York,¹⁰² Virginia,¹⁰³ Nevada,¹⁰⁴ Florida,¹⁰⁵ Colorado,¹⁰⁶ New Hampshire,¹⁰⁷ Washington,¹⁰⁸ and Illinois.¹⁰⁹ But even the strictest regulations contain gaping regulatory holes allowing scrapers to run wild with individuals’ data.

California has enacted the most comprehensive data privacy laws to date in the United States.¹¹⁰ The California Consumer Privacy Act (CCPA) went into effect in 2020 and provides Californians with certain rights regarding businesses’ collection and sale of their personal information.¹¹¹ The California Privacy Rights Act (CPRA) was then enacted in November 2020 and will take effect in January 2023.¹¹² It expands upon the CCPA,creating a new agency called the California Privacy Protection Agency dedicated to enforcing the new privacy law.¹¹³ But California’s legislation, however, does not prevent companies from using bots to scrape personal information from publicly available websites. Scraped data falls outside of its scope and remains unregulated.

To illustrate, the CPRA gives California consumers the ability to opt out from companies sharing, selling, or even retaining their data.¹¹⁴ But what if those third parties simply scrape their data instead? In that case, the information was neither shared nor sold. The scrapers just took it, leaving the subjects unable to opt out. The CPRA also allows consumers to request disclosure of the “categories of personal information that the business collected about the consumer” and the “categories of personal information that the business sold or shared about the consumer and the categories of third parties to whom the personal information was sold or shared.”¹¹⁵ But consumers cannot be made aware of this same information if some unknown party simply scrapes their data.

While section 1798.100 of the CPRA provides an expansive notice-at-collection provision, requiring certain businesses to inform their consumers about aspects of personal data collection,¹¹⁶ publicly available information remains unprotected by that same statute’s definition of “personal information.” The definition of “personal information” in section 1798.140(v)(2) expressly excludes publicly available information.¹¹⁷ Thus, the statute permits businesses to scrape personal information from publicly available websites without providing any notice.

Moreover, regulations issued by California’s attorney general pursuant to the CPRA provide that “[a] business that does not collect personal information directly from the consumer does not need to provide a notice at collection to the consumer if it does not sell the consumer’s personal information.”¹¹⁸ Thus, a business that collects a consumer’s personal information by scraping it from an intermediate source only needs to provide notice to the consumer if it intends to sell it.¹¹⁹ The language in these provisions reveals a gaping hole in personal data privacy regulations.

Other states have followed California’s lead, but similarly fail to address privacy concerns for publicly available data. Virginia, for example, became the second state to enact a comprehensive data privacy statute in March 2021.¹²⁰ Virginia’s law, the Consumer Data Protection Act, imposes data processing obligations for businesses processing consumers’ personal information, and it gives consumers various privacy rights similar to those granted by California law.¹²¹ The legislation contains no private right of action and exempts several entities and types of data.¹²² Like the California legislation, it excludes publicly available information from its definition of “personal data,”¹²³ and it defines “publicly available information” broadly, encompassing information that “a business has a reasonable basis to believe is lawfully made available to the general public through widely distributed media, by the consumer, or by a person to whom the consumer has disclosed the information, unless the consumer has restricted the information to a specific audience.”¹²⁴ The law also limits the obligations imposed on data processors where those obligations “adversely affect[] the rights or freedoms of any persons, such as exercising the right of free speech under the First Amendment to the United States Constitution.”¹²⁵

Similar defects are present in Colorado’s recently enacted privacy law, the third comprehensive data privacy statute adopted in the United States.¹²⁶ Like its California and Virginia kin, the Colorado Privacy Act excludes publicly available information from the scope of its regulations, and its definition of “publicly available information” covers any “information that a controller has a reasonable basis to believe the consumer has lawfully made available to the general public.”¹²⁷ In sum, businesses scraping publicly available personal data remain unregulated even by the most expansive state data privacy laws.

At the federal level, privacy legislation has been similarly inadequate. For example, the proposed Consumer Data Privacy and Security Act of 2020 exempts publicly available information from its scope.¹²⁸ It also contains a broad definition of publicly available information,¹²⁹ permitting data scrapers to extract whatever personal information is posted on publicly available websites so long as there is a reasonable basis to believe that the individual volunteered it.

Another proposal, the SAFE DATA Act, similarly excludes publicly available information from its protection.¹³⁰ It broadly defines “publicly available information” to include any information that the entity reasonably believes has been made widely available to the general public, including information from a public website.¹³¹ The Online Privacy Act¹³² and the Privacy Bill of Rights Act¹³³ are comparably deficient because they also exclude publicly available information.¹³⁴ Simply put: no currently enacted or proposed legislation in the United States satisfactorily shields individuals’ publicly available personal information from the claws of data scrapers.

C. An Alternative Framework: The European Union’s General Data Protection Regulation

In sharp contrast with the United States, the European Union considers data privacy a fundamental right.¹³⁵ Even the scope of what is considered “personal data” or “personally identifiable information” differs substantially. In the United States, these terms apply to specific categories of information, with restrictions placed on the use of those categories of information applying only to certain industries.¹³⁶ Conversely, the EU’s definition of personal data is deliberately broad and aimed at protecting an individual’s right to privacy.¹³⁷ Comparing these legal frameworks reveals that the United States’ laws are underdeveloped with respect to ensuring data privacy for its people. If the United States wishes to make progress in this field, it should follow Europe’s lead.

In 2016, the EU passed the General Data Protection Regulation (GDPR).¹³⁸ It is described as “the toughest privacy and security law in the world,” imposing obligations on any organization—regardless of location—that targets or collects data related to people in the EU.¹³⁹ Unlike the CPRA, the GDPR’s definition of “personal data” contains no exception for publicly available information.¹⁴⁰ The regulation provides EU citizens with rights, including the right to be notified when their personal data are collected, the right to access any of their collected personal data, the right to rectify inaccurate personal data, and the right to erasure of their personal data.¹⁴¹

Most relevant to the issue of data scraping is article 14 of the GDPR. It obligates data controllers¹⁴² to inform those whose personal data they intend to process when the information in question has not been directly obtained from them—for instance, when their personal data have been scraped off the public internet.¹⁴³ Pursuant to article 14, data scrapers—when they scrape publicly available personal information concerning persons in the EU—must provide extensive notice to every data subject¹⁴⁴ within one month of scraping their data.¹⁴⁵

However, even when data scrapers are able to provide such notice to all of the data subjects, the scraping must still meet certain criteria for it to be lawful. Article 5 provides that personal data shall be collected for “specified, explicit and legitimate purposes.”¹⁴⁶ Moreover, scrapers may only collect information necessary for the purposes for which the data are processed¹⁴⁷ and may not retain personal data longer than is necessary for those purposes.¹⁴⁸

Finally, the GDPR requires a lawful basis for data collection. There are six lawful bases available under the GDPR: (1) consent, (2) contract with the data subject, (3) compliance with a legal obligation, (4) vital interest, (5) public interest, and (6) legitimate interest.¹⁴⁹ Of these, the only fitting lawful ground for scraping is legitimate interest.¹⁵⁰ For scraping to satisfy the legitimate-interest lawful-basis requirement, the data scraper’s legitimate interest must outweigh the data subject’s “interests or fundamental rights and freedoms.”¹⁵¹ Together, these requirements dramatically restrict lawful data-scraping activity.

In March 2019, Poland’s Data Protection Agency (DPA), acting pursuant to the GDPR, issued its first fine involving data scraping.¹⁵² The agency held that Bisnode—a digital marketing company that scraped six million people’s personal data—failed to respect data subject rights set out in article 14 of the GDPR because it did not notify the data subjects.¹⁵³ Bisnode was slapped with a €220,000 fine and given three months to comply with article 14’s information-notification requirements.¹⁵⁴ Bisnode attempted to meet its notification obligation through a website posting, but the Polish DPA “rejected the argument that placing a privacy notice on the data scraping business’s website was enough to notify individuals, particularly where individuals were not aware that their data had been scraped and was being processed.”¹⁵⁵ Similarly, in April 2021, Spain’s data protection authority ordered Equifax to delete personal data it collected and pay a fine of about $1.1 million for including in credit reports publicly available data it scraped from government sources about individuals’ outstanding debts.¹⁵⁶

As Section II.B analyzed, the current and proposed data privacy statutes in the United States contain loopholes that allow data scraping of personal information to go unregulated. However, the GDPR’s application to U.S. companies does not fill those loopholes. Instead, the GDPR should serve as a model for U.S. legislation with respect to preventing and deterring scraping individuals’ personal information.

Even though U.S. companies are not exempt from the GDPR’s territorial scope,¹⁵⁷ domestic legislation is required to similarly protect U.S. persons’ data from data scrapers. The GDPR’s regulations apply to companies established in the EU and companies (including those in the United States) that process personal data of subjects who are in the EU.¹⁵⁸ Notably, the GDPR’s application is not limited to the collection and processing of EU citizens and residents’ data. For example, it includes U.S. persons located within EU borders when their data is processed.¹⁵⁹

But the fact that a U.S. company complies with the GDPR does not necessarily mean that domestic U.S. persons’ data is protected. First, companies often have different versions of their websites based on the various territories in which they do business, each version providing different data privacy rights, policies, and procedures.¹⁶⁰ Companies that comply with the GDPR may do so only on a territorial basis, and their scraping activity may similarly follow territorial bounds.

Second, even if U.S. companies chose to follow GDPR standards for all data subjects (including domestic U.S. persons), this would not confer upon U.S. persons the full breadth of the GDPR’s rights and protections, requiring domestic legislation to fill the gaps. For example, article 82 of the GDPR provides the right to compensation for “[a]ny person who has suffered material or non-material damage as a result of an infringement” of the GDPR.¹⁶¹ Article 77 permits such persons to “lodge a complaint with a supervisory authority” to enforce their rights.¹⁶² Conversely, domestic U.S. persons, who the GDPR does not protect, would not have any such remedy or method to enforce their rights without U.S.-specific legislation. Thus, a U.S. company that scrapes personal data without providing notice at collection pursuant to article 14 would be subject to liability only where the personal data collected are that of persons in the EU, but the company would not be subject to liability for scraping data of persons in the United States. If no notice is provided (or if the collection violates other provisions of the GDPR), persons in the EU can lodge a complaint to enforce their rights; persons in the United States cannot.

For these reasons, the United States must enact domestic privacy legislation to ensure similar data protection for its people, and it should look to the GDPR as a model for such legislation. With respect to data scraping, a domestic statute aligned with the GDPR should contain a definition of personal information that doesn’t exclude publicly available information¹⁶³ and a provision similar to article 14, which requires a business to give notice to data subjects whose information it scrapes from the internet.¹⁶⁴

III. A Proposal for California: “Fair Collection”

While passing legislation at the federal level could be desirable, this Part asserts that California should reform its data privacy legislation to conform with the protections afforded by the GDPR. Doing so would deter impermissible scraping by providing a remedy to individuals whose personal information has been scraped without notice. Finally, this Part addresses potential counterarguments, including concerns regarding the First Amendment implications of attempting to regulate the collection of publicly available data.

A. California Should Adopt GDPR-Style Regulations to Shield Publicly Available Personal Information from Data Scrapers

In the United States, many have called for preemptive legislation at the federal level to fill the domestic consumer data privacy void.¹⁶⁵ Several reports indicate that both Democrats and Republicans want to “take on Big Tech” with laws and regulations addressing several issues, including data privacy.¹⁶⁶ To be clear, data privacy legislation at the federal level could be beneficial. But to date, Congress’s federal privacy legislation has been limited to sector-specific laws.¹⁶⁷

Gridlock in Washington might diminish any potential for an all-encompassing data privacy law,¹⁶⁸ especially one that addresses this Note’s narrow topic of scraping publicly available personal information. Consumer privacy advocates have raised the concern that federal legislation would not embrace the comprehensiveness and strictness of the CPRA or GDPR.¹⁶⁹ The fear is that federal preemptive legislation would “wipe[] out more stringent state rules” like those in California.¹⁷⁰ And despite the benefits that might flow from uniformity throughout the nation, consumers could be left with “the lowest common denominator” of privacy legislation.¹⁷¹ Instead, some argue that any federal standard should serve as a minimum level of compliance, allowing states to pass their own stronger laws.¹⁷² Because this Note is narrowly focused on reforming the way privacy law treats publicly available personal information for purposes of data scraping, the most straightforward approach is to amend the California legislation. Doing so could serve as a model for future federal legislation.

To address data scraping of publicly available personal information, California should amend its privacy laws enacted through the CCPA and CPRA. First, it should remove section 1798.140(v)(2), which, as previously noted, excludes publicly available information from its definition of personal information.¹⁷³ Removing this provision would keep publicly available personal information within the scope of California’s privacy protections, as it remains within the scope of the GDPR’s protections.¹⁷⁴

Second, California should not exempt businesses from notice-at-collection requirements when they do not collect personal information directly from the consumer.¹⁷⁵ Thus, as in article 14 of the GDPR, businesses would also have to provide notice when collecting personal information indirectly or from a source other than the data subject.¹⁷⁶

Third, California should expand the private right of action provided by section 1798.150. Currently, that provision only permits consumers to bring civil actions if their information is subject to a data breach.¹⁷⁷ It should be expanded to allow consumers to bring civil actions when businesses fail to notify individuals that their personal data have been collected.

In place of the removed provisions, California’s legislature ought to adopt a more nuanced approach that would prohibit most forms of data scraping while permitting innocuous collections of personal information. This would be similar to permitting scraping where there is a lawful basis under the GDPR.¹⁷⁸ Here, this Note envisions permitting data scraping when the information collected is more likely to be anonymized, is not collected in bulk, and is collected for journalistic or academic purposes. Just as the “fair use” doctrine in the Copyright Act allows certain permissible uses of a copyrighted work to avoid copyright infringement liability,¹⁷⁹ California’s privacy regulations should exempt certain collections and uses of personal information that it deems permissible when the personal information is not collected directly from the subject. Let’s call it “fair collection.” This Note proposes the following language:

Notwithstanding § 1798.100, the “fair collection” of personal information—such as when the personal information is collected in small quantities for academic, educational, or journalistic purposes—is exempt from the notice-at-collection requirements when the personal information is not collected directly from the consumer. In determining whether the collection of personal information in any particular case constitutes “fair collection,” the factors to be considered shall include:
(a)   the personal nature of the information collected, such as whether the information is anonymized or is capable of individually identifying a person;
(b)   the volume of the information collected; and
(c)   the purpose and character of the collection, including whether the collection is done for academic or legitimate news reporting purposes or to address matters of public concern, or instead is collected for commercial purposes.

Taking a “fair collection” approach would allow California to regulate data scrapers that collect massive amounts of personal data for commercial purposes while permitting small amounts of data collection when it is unlikely to be harmful.

The three proposed factors close the present gaps in the CCPA and CPRA. Factor (a) considers whether the information collected is capable of personally identifying an individual, which is the reason for regulating this activity in the first place. Some information, like health data or internet browsing history, is capable of anonymization and thus could be permissibly collected. But collecting an email address, IP address, phone number, or an image of someone’s face should be regulated because such information is inextricably linked to a particular individual and cannot be anonymized unless outright redacted.

Factor (b) considers the volume of the data collected. Data scraping is a particular method of gathering information. What makes scraping different from an individual user manually gathering information from the internet is the ability for the data scraper to collect information automatically and in bulk.¹⁸⁰ If a business downloads a handful of publicly available email addresses, that could be exempted from regulation under this Note’s proposal. But if a business uses bots to collect thousands of email addresses from some publicly available source, it would be regulated. The absence of a bright-line rule for what volume of collection is permissible is a feature, not a bug. If scrapers aren’t sure how much collection is too much, that uncertainty functions to deter scraping.¹⁸¹ Conversely, if businesses knew that scraping under a certain volume of personal data would likely be permissible, they might confidently continue to do so.

Factor (c) considers the purpose and character of the collection. Personal data collected for commercial purposes would be subject to greater scrutiny than data collected for academic or journalistic purposes, or to address matters of public concern. Taken together, collecting publicly available personal information would be permissible when the information is less likely to identify a specific individual, the collection only concerns a small number of data subjects, and the collection furthers a beneficial public purpose. Collecting publicly available personal information would be impermissible when the information identifies individual subjects, is collected in large quantities, and is collected for commercial purposes. This proposed reform would finally address data scraping and protect individuals’ personal information regardless of whether it is publicly available. Further, it would align the protections of the CPRA more closely to those of the GDPR.

While this Note cites many instances of scraping activity conducted by businesses, individual malicious actors also partake in data scraping. Recall the massive leaks of over 533 million Facebook users’¹⁸² and 500 million LinkedIn users’ personal information obtained by data scrapers.¹⁸³ As of this writing, there is no indication that any corporate data firm was responsible for this bulk scraping.¹⁸⁴ The reporting suggests that both leaks were the result of coordinated scraping efforts conducted by individuals, not businesses. Shouldn’t California’s privacy legislation regulate this activity as well?

Presently, the CPRA’s regulations apply only to businesses that (a) have annual gross revenues in excess of $25 million; (b) buy, sell, or share the personal information of 100,000 or more consumers or households; or (c) derive 50 percent or more of their annual revenues from selling or sharing consumers’ personal information.¹⁸⁵ Individuals who collect personal data are not regulated. While a proposal to enact civil or criminal sanctions on individuals is beyond the scope of this Note, California should also consider methods to address massive data collection at the hands of individual scrapers who might use the data to conduct scams and cybercrimes.

B. Addressing First Amendment Concerns

Critics of limiting scraping in the ways this Note proposes would likely argue that such restrictions violate the First Amendment.¹⁸⁶ In Sorrel v. IMS Health Inc., the Supreme Court held that creating and disseminating information qualify as protected speech under the First Amendment.¹⁸⁷ While restrictions on scraping would not directly implicate the publication of personal information, they would limit accessing and recording publicly available facts—activities that contribute to the creation of speech. Restricting the ability to access and record facts disables one from later speaking and disseminating information about those facts. The Sorrell Court noted that “[f]acts, after all, are the beginning point for much of the speech that is most essential to advance human knowledge and to conduct human affairs.”¹⁸⁸ It follows that laws burdening the underlying inputs of speech implicate the First Amendment.

If scraping qualifies as speech, it would likely be considered conduct “incidental to, or in preparation for, speech” under the First Amendment.¹⁸⁹ For instance, some argue that video recording is a form of expression covered by the First Amendment because it is conduct essential to speech.¹⁹⁰ In ACLU of Illinois v. Alvarez, the Seventh Circuit recognized a right to record in enjoining the enforcement of an Illinois all-party-consent wiretap statute.¹⁹¹ There, the court held that “[c]riminalizing all nonconsensual audio recording necessarily limits the information that might later be published or broadcast . . . and thus burdens First Amendment rights.”¹⁹² The right to create the recording, the court reasoned, is “necessarily included within the First Amendment’s guarantee of speech and press rights as a corollary of the right to disseminate the resulting recording.”¹⁹³

This reasoning may also extend to data scraping. As in Alvarez, limiting scraping “necessarily limits the information that might later be published or broadcast,” and thus burdens First Amendment rights.¹⁹⁴ In Sandvig v. Sessions, a case involving data scraping of a publicly available website, the court observed that “even if a law says nothing about speech on its face, it is subject to First Amendment scrutiny if it restricts access to traditional public fora.”¹⁹⁵ There, because the statute at issue “limit[ed] access to and burden[ed] speech in the public forum that is the public Internet,” heightened First Amendment scrutiny was appropriate.¹⁹⁶

The question, then, is whether this Note’s proposed limitations on data scraping would survive First Amendment scrutiny. To reiterate, this Note’s reform would limit scraping of personal information in bulk for predominantly commercial purposes. Where commercial speech is involved, courts apply intermediate scrutiny: the state’s restriction on commercial speech must directly advance a substantial governmental interest and must be drawn to achieve that interest.¹⁹⁷

First, in limiting how corporations collect and monetize consumers’ personal information, governments like California’s have a “substantial interest” in promoting consumer data privacy.¹⁹⁸ Indeed, the California Constitution expressly makes privacy an “inalienable” right of all people.¹⁹⁹ And as the Supreme Court has recognized, the fact that “an event is not wholly ‘private’ does not mean that an individual has no interest in limiting disclosure or dissemination of the information.”²⁰⁰ The existence of a modern technological tool like scraping “only heightens the consequences of disclosure—‘in today’s society the computer can accumulate and store information that would otherwise have surely been forgotten.’ ”²⁰¹ Here, scraping poses a substantial threat to individuals’ privacy, especially in cases where their personal information has been made public without their knowledge or consent.²⁰² It allows data that individuals may intend to restrict to instead be continuously collected and shared outside their control.

Second, California also has a substantial interest in protecting its residents’ First Amendment interests—namely, free expression that relies on privacy. In her concurrence in United States v. Jones, Justice Sotomayor noted that even where personal information is publicly available, its collection and compilation can reveal a “comprehensive record” of a person’s activity that reflects “a wealth of detail about her familial, political, professional, religious, and sexual associations.”²⁰³ Data scraping could be viewed “as such an egregious invasion of privacy that users’ First Amendment activity on online platforms would be chilled.”²⁰⁴ Fear that all of an individual’s personal information is susceptible to scraping and misappropriation could curb the use of certain internet platforms. Fear or suspicion that one’s speech is constantly monitored “can have a seriously inhibiting effect upon the willingness to voice critical and constructive ideas.”²⁰⁵ Finally, California has a substantial interest in protecting the integrity of its elections. As exposed by the malfeasance of Aggregate IQ and Cambridge Analytica, data scraping has the potential to undermine elections by scraping individuals’ social media profile information and serving them targeted ads meant to influence their vote.²⁰⁶

The statutory reform proposed in this Note—limiting the bulk collection of publicly available personal information for commercial purposes—is narrowly drawn to meet California’s interests and thus should pass First Amendment scrutiny. The Constitution affords lesser protection to commercial speech than to other constitutionally guaranteed expression.²⁰⁷This reform does not prohibit all access to publicly available information; it merely restricts its collection in bulk and for commercial purposes when it involves personal information that cannot be anonymized.

Indeed, there are provisions of California’s current privacy laws that arguably infringe on First Amendment rights far more than this Note’s proposal.²⁰⁸ In the context of government collection of personal information for criminal investigation purposes, the Supreme Court has held that “a person has no legitimate expectation of privacy in information he voluntarily turns over to third parties.”²⁰⁹ But, as this Note explains, information is often publicized without any voluntary action or consent from the data subject. And this Note’s proposed reform does not restrict collecting single individuals’ information, but rather data in bulk. Most importantly, it also exempts from its restrictions data collected for journalistic purposes or to address matters of public concern. Criminal investigations would surely qualify for this exemption. The statutory language suggested in this Note likely comports with the Supreme Court’s view of privacy and does not regulate beyond what is necessary to meet California’s interests. Accordingly, it should survive any challenges sounding in the First Amendment.

Conclusion

Data scraping can be greatly beneficial, but it presents serious concerns when the data contains individuals’ personal information. As the author of this Note, I am grateful for data-scraping technology because it made this Note possible. After all, the research tools I used aggregate publicly available information in the form of statutes, cases, and law review articles. But when the information collected is not a judicial opinion but an individual’s personal data, more is at stake. Scraping of such data in bulk can harm individual privacy, undermine democracy, and potentially even physically endanger us. Today’s privacy statutes do not do enough to address this issue, allowing businesses to scrape and repurpose our personal information with near impunity. California should adopt a new approach that restricts the collection of even publicly available personal information, only allowing such collection when it deems it fair and permissible. Other states—and perhaps the federal government—should soon follow suit.

120MichLRev913_Parks Download

* J.D. Candidate, May 2022, University of Michigan Law School. I am grateful to Professor Barbara McQuade for her wonderful input, insights, and encouragement. Thank you to Emily Grau for her thoughtful comments throughout the writing of this piece. Thank you to my family, especially George Parks, for their support. Finally, thank you to all the members of the Michigan Law Review Volume 120 Notes Office for their invaluable feedback and edits.