Skip to main content

Is Web Scraping Legal? Unraveling the Complexities of Data Privacy & Ownership


AI data scraping could put your private info at risk. The Lyon Firm can protect you. 

Nationwide Success

Investigating Illegal Data Harvesting and AI Privacy Violations

AI and data collection are closely connected, driving new technological breakthroughs. The very nature of AI relies on consuming huge amounts of data to train and test algorithms. It is a simple strategy: the more data available to train online chatbots, the better they perform.

Tech companies have unleashed powerful software that collects vast amounts of available data to teach generative AI systems. However, this is blurring the lines between legal data sharing and a serious invasion of privacy, raising significant AI legal issues. National Trial Lawyers Top 100

As companies race to create new and innovative technologies, they may violate data privacy laws in several states. The Lyon Firm has nearly two decades of experience defending consumer privacy and is investigating egregious AI data harvesting practices nationwide. 

If you have reason to believe any AI system is unlawfully scraping data, you can consider filing a class action AI lawsuit against a negligent company. Contact us online or call (513) 381-2333 to chat with our experienced team today. 

 

The large-scale scraping, or harvesting, of data, feeds the development of AI software, though it also runs counter to the importance of limiting data collection and protecting individual privacy. We know machine learning requires data to operate to its fullest capacity, but is AI data harvesting legal and ethical?

Is Data Scraping Legal?

Data scraping also known as web scraping is the process of extracting large amounts of information from websites automatically through software. These software tools mimic human browsing to access webpages, collect data, and organize it for other purposes. 

You may be asking: what is AI data scraping used for? Companies rely on data scraping for: 

  • Competitive analysis
  • Market research
  • Monitoring and comparing prices
  • Web content research 

Technically, there are no laws in the United States prohibiting web scraping, and many businesses use it to gain valuable insights to improve services or make data-informed decisions. 

However, some websites expressly prohibit data extraction in their terms of service. Also, scraping personal or copyrighted data without permission may lead to legal consequences. 

AI Web Scraping

Things get a little more complicated when artificial intelligence is involved in data scraping. Companies can feed an AI program tons of diverse data to learn from, including social media trends, market statistics, and secondary educational resources.

While AI enhances the process by enabling quicker and more advanced data extraction and analysis, it may introduce new risks. 

AI algorithms scrape massive amounts of data to learn and grow. Yet, they may infringe on copyright law and individuals’ personal privacy by using their information without their knowledge or consent. 

Opponents of this data harvesting practice argue that no web scraping process can be performed without some kind of privacy violation or data breach. As a result, several AI lawsuits have been filed against large corporate defendants. 

CONTACT THE LYON FIRM TODAY

Please complete the form below for a FREE consultation.

  • This field is for validation purposes and should be left unchanged.

ABOUT THE LYON FIRM

Joseph Lyon has 17 years of experience representing individuals in complex litigation matters. He has represented individuals in every state against many of the largest companies in the world.

The Firm focuses on single-event civil cases and class actions involving corporate neglect & fraud, toxic exposure, product defects & recalls, medical malpractice, and invasion of privacy.

NO COST UNLESS WE WIN

The Firm offers contingency fees, advancing all costs of the litigation, and accepting the full financial risk, allowing our clients full access to the legal system while reducing the financial stress while they focus on their healthcare and financial needs.

How Is AI Collecting My Data?

The chances are pretty good that something you have entered online has been ingested by an AI program. This could be something as simple as a Google review, a Facebook post, or a blog or article you wrote in the past. There are many sources of data for AI tools, which include the following:

  • Structured data: databases or spreadsheets.
  • Unstructured data: emails, social media posts, photos, and voice recordings.
  • Streaming data: data generated from sources like IoT devices, stock price feeds, and social media streams.

AI tools collect data directly from online forms, surveys, and tracking codes, and indirectly from users’ likes, shares, and comments. The software will then transform raw data by cleaning, processing, and analyzing it. The all-important data cleaning process is meant to remove or correct errors to ensure the data is accurate, consistent, and reliable.

Consumer privacy protections should be considered during this entire process, though our attorneys suspect these considerations are often forgotten in the race to create the best AI product.

The techniques and data harvesting methods used to gather this data include the following:

  • Web Scraping: The most well-known way Big Tech has trained its AI is by gathering vast amounts of data by automatically pillaging content from websites. While some of this data is public, personal details may be collected without user consent.
  • Biometric Data: AI systems use facial recognition, fingerprinting, and other biometric technologies gathered from smartphones, work security systems, and IoT home appliances, raising substantial privacy concerns.
  • Social Media: AI algorithms analyze social media activity, capturing demographic information, consumer preferences, and other data that accumulates on an hourly basis.

AI’s Insatiable Quest for Data 

Generative artificial intelligence systems like Bard, Copilot, and OpenAI rely on all available data for optimal machine learning. These systems gobble up news stories, fictional works, forum posts, Wikipedia articles, photos, podcasts, and YouTube videos—essentially any available text, images, audio files, and videos. 

You would think companies would want to be more selective, but the principle remains: the more information you feed OpenAI, the faster it learns and the more intelligent it becomes. More than ever, these models require high-quality proprietary data. 

While the internet offers vast scraping opportunities, its resources are finite. Even with large websites like Reddit and Wikipedia growing their data load daily, companies and AI programs are using the data faster than it is being produced.

Therefore, companies that train these AI tools are always looking for more data they can harvest and feed a system. To establish themselves as leaders in AI and compete with the best products, big tech companies are on a constant hunt for digital data needed to advance the technology.

Generative AI Legal Issues

So, where is this new data coming from? Over the years, big tech companies have tried to adjust their privacy policies to use all the data they collect from users legally. However, just because information is available online does not automatically grant AI systems the right to use it. 

For example, Google has admitted to training AI models on some YouTube content. Meta has said it has integrated AI into its services to use the billions of images and videos from Instagram and Facebook to help train its models. 

Below are a few ways in which generative AI data scraping can violate your privacy and legal rights: 

  • Copyright infringement: Online content like news articles, artwork, songs, and more is protected under copyright law. Using an AI system to scrape and use this content without consent from the owner of the copyright could lead to a legal claim.
  • Breaching terms of service: Some websites often ban data scraping in their terms of service, so if an AI program scrapes this website, it could be directly violating this contract. 
  • Violation of privacy rights: Many state and federal laws require notification, consent, and the ability to opt-out regarding people’s personal information. A company could be liable if an AI model fails to follow these laws or regulations.
  • Computer Fraud and Abuse Act (CFAA): This law oversees unauthorized access to computer systems and data. If an AI system scrapes data from a website with authentication or a paywall, it could violate the CFAA.  

The Lyon Firm’s privacy lawyers are investigating several complaints from individuals and companies who claim the web scraping tactics of companies have been intrusive and violated their privacy rights. Some questionable data-gathering practices have prompted lawsuits over copyright and licensing.

Many fear that generative AI tools could leverage their creative and artistic works to surpass them in output, threatening their careers. In an increasingly AI-driven world, it’s crucial for those affected to consider legal action to ensure proper recognition for their work. 

These lawsuits will shape the future of the rights of creators going forward. Don’t let your private information or hard work become fodder for AI. Contact The Lyon Firm online or call (513) 381-2333 to take a stand against improper AI web scraping. 

Companies With a Web Scraping Lawsuit Filed Against Them

Several companies are caught in the middle of the debate to answer: is web scraping legal? Such examples where companies’ use of AI to harvest data has landed them in hot water include: 

  • A class action AI privacy lawsuit was filed against GitHub, Microsoft, and OpenAI regarding Copilot, a tool that predictively generates code based on what a programmer has already written. Plaintiffs allege Copilot used code and failed to provide attribution to the original programmer.

  • A class action AI lawsuit was filed against Stability AI, Midjourney, and DeviantArt, with plaintiffs alleging the programs directly infringe on individual copyrights by training the AI on works created by the plaintiffs.

  • Getty Images filed another complaint against Stability AI for allegedly copying and processing millions of images owned by Getty in the U.K.

  • Many authors are suing OpenAI for allegedly infringing on authors’ copyrights. The complaint estimates that hundreds of thousands of books were copied into OpenAI’s training data. In a similar case, The New York Times is suing OpenAI for copyright infringement, with allegations that millions of articles were used to train and develop OpenAI’s chatbot.

  • A class action AI lawsuit was filed against Google for allegedly misusing personal information and copyright infringement. The lawsuit says photos from dating websites, Spotify playlists, TikTok videos, and books were taken without consent to train Bard.

Examples of AI Data Lawsuit Settlements

Multiple companies have been ordered to pay large sums for allegedly illegal data scraping, including: 

  • In June 2024, Clearview AI agreed to pay $50 million to compensate plaintiffs who accused the company of scraping publicly available images on websites like Facebook, Venmo, and millions of others.

  • Vimeo agreed to pay $2.25 million to certain users of its AI-based video creation and editing platform Magisto to resolve claims it collected and stored their biometric data in April 2023. 

  • Google reached a $1.6 billion settlement with Singular Computing to resolve a lawsuit accusing the search giant of patent infringement during its development of artificial intelligence (AI) programs.

Copyright Dispute and AI Scraping 

The New York Times and several other newspapers have sued OpenAI and Microsoft for using copyrighted news articles without permission to train chatbots. However, OpenAI and Microsoft argue that using the articles was legal under copyright law because they altered the works for a completely different purpose. 

According to a report by The New York Times, tech companies like OpenAI, Google, and Meta have allegedly ignored corporate policies, intellectual property rights, and widespread consumer privacy concerns to obtain the all-important data.

The “fair use” doctrine allows third parties limited use of copyrighted works without permission or payment. Yet, news outlets argue these companies cannot freely use their material without consent or compensation. Furthermore, in some cases, chatbots are incorrectly citing articles and falsely attributing reporting to certain newspapers.

Are There AI Lawsuits for Copyright Infringement?

Just as in the past, with new technology, innovations in artificial intelligence are raising questions about how copyright law principles will apply to content created or used by AI. Many individuals and companies have filed lawsuits over what they believe are illegal data collection practices. 

The top chatbot systems have reportedly been trained on reams of digital text spanning trillions of words. However, the most important data for any AI model is high-quality information in books and articles, which are almost all protected by copyright. 

In the race to find large amounts of data to dump into their AI model, Meta, the parent company of Facebook and Instagram, recently discussed buying the publishing house Simon & Schuster to obtain the rights to important works.

Unfortunately, these companies are not always willing to pay. Generative AI systems have copied huge amounts of data that should require a license, infringing on copyrights. The Digital Millennium Copyright Act (DMCA) provides restrictions on the removal or alteration of copyright management information.

The U.S. Copyright Office has begun to confront the issue as they have been bombarded with complaints and questions about how generative AI outputs can be copyrighted and how generative AI might infringe on existing copyrights.

Can AI-Generated Works be Copyrighted?

The question of whether copyright protection can be afforded to AI outputs hinges partly on the concept of “authorship.” The Copyright Act affords copyright protection to “original works of authorship.” The law does not clearly define who or what may be an “author.” However, the U.S. Copyright Office recognizes copyright only in works “created by a human being.”

This makes sense, of course, but when humans author a work with the help of an AI system, there are suddenly gray areas. As it stands right now, works created by humans using generative AI software can be entitled to copyright protection, but each case depends on the level of human involvement in the creative process. If AI programs are used, the individual also must disclose the use of AI, or they could violate current copyright law.

In March 2023, the Copyright Office released guidance on the use of AI and how AI-generated material may be copyrighted when “sufficiently creative” human arrangements are combined with AI-generated material.

When AI determines the expressive elements of its output, the generated material is not the product of human authorship.

Unfortunately, this is all rather vague, leading the courts to interpret these situations differently. Right now, there are very few past court judgments to cite, and new cases will be heard on an individual basis for quite some time.

Strong proponents of artificial technology say AI-generated works should receive copyright protection, arguing that AI programs are very similar to other tools that human beings have used to create copyrighted works in the past.

But if you assume that some AI-created works may be eligible for copyright protection, who owns that copyright? The original forces behind the coding and training of an AI program could give an AI creator a strong claim to some form of authorship. However, at the moment, Open-AI’s terms of use designate any copyright to the end user.

Who Can File a Web Scraping Lawsuit?

Recent plaintiffs in AI data lawsuits have included authors, artists, and major media organizations claiming AI programs have stolen their work without consent or compensation. Many have filed claims against companies developing programs, claiming the organizations breached copyright law.

Also, lawsuits may not only apply to outright using someone’s work. There are also potential privacy violations when generating work in the style of someone else. This protection is known as the right of publicity, which prohibits using someone’s likeness, name, image, voice, or signature for commercial gain.

Can I Join a Class Action AI Data Lawsuit?

It’s not only large media conglomerates that are concerned about the use of their original material. Thousands of trade groups, artists, and authors have raised awareness of the need for more direct, clear protections of their works regarding artificial intelligence software and tools. 

While Big Tech dolls up their data harvesting methods as “transformational” and “for the greater good,” individuals whose works have been used are calling it simply “institutional theft.”

You do not need to be a massive corporation in order to take legal action against AI companies wrongly appropriating content. There have been multiple instances where individuals have banded together in a class action to retake control of their personal information or intellectual property. 

Can I File an AI Lawsuit?

If you believe AI data collection practices have compromised your private information, you may qualify for a case. This may be the only way to protect your personal privacy and intellectual property.

Without your resolve to file an AI lawsuit, large corporate defendants will continue to invade, utilize, and even publicize the information of many without any risk of monetary penalty. 

By holding companies accountable for any blatant invasion of privacy, every individual will have more control over how their data is used in the future. 

photo of data privacy attorney Joe Lyon

Reviewing AI Data Lawsuits & AI Privacy Violations

Why Hire The Lyon Firm?

The Lyon Firm has been dedicated to protecting people’s sensitive information in the face of data breaches, cybersecurity threats, and related security challenges for nearly two decades. 

However, the rapid growth of AI technology presents new and complicated risks to personal information and intellectual property. We are deeply concerned about how AI scraping can jeopardize the security and privacy of millions. 

In this evolving and unprecedented landscape, we remain committed to doing what we have always done: defending privacy and ensuring accountability. Contact us online or call (513) 381-2333 to learn how we can help you. 

CONTACT THE LYON FIRM TODAY

  • This field is for validation purposes and should be left unchanged.

Questions About AI Data Lawsuits

What is an Example of an AI Data Lawsuit?

OpenAI and Google have allegedly transcribed YouTube videos to harvest text for their competing AI systems, potentially violating the copyrights of individual video creators. Google was reportedly aware that OpenAI was actively harvesting YouTube videos for data. However, Google didn’t blink an eye because it had done the exact same thing to train its AI models. 

Even though Google owns YouTube, that particular practice of web scraping may have violated the copyrights of YouTube creators, and they didn’t want to suddenly face a mountain of privacy violation AI lawsuits.

YouTube clearly prohibits individuals from using its videos for “independent” applications and also prohibits accessing its videos by “automated means (such as robots, botnets, or scrapers).”

So there are actually some legal protections in place. Yet, Big Tech largely decided that creating the biggest and best generative AI software is more important than worrying about facing a massive class action down the line.

OpenAI’s GPT-4 apparently uses at least a million hours of YouTube video content, all transcribed by a software tool. OpenAI staff allegedly knew they were trampling on the rights of creators with their scraping of YouTube content but then argued that training their product with the videos was fair use.

Management at Meta even allegedly pitched the idea of knowingly using copyrighted data from across the internet to build their competing model and face the legal consequences after the fact. 

There is no evidence this has occurred, but it is clear what the sentiment is in Big Tech board rooms, and other entities have been sued for similar data harvesting offenses.

Can I file an AI Discrimination lawsuit?

One of the major concerns about AI being fed vast amounts of data is that the programs are sometimes unaware that some data could be sensitive. Researchers attempt to clean data by removing hate speech and other unwanted text before it can be used to train AI models. This filter has predictably failed due to the huge amount of data to clean.

AI tools can also mistake completely innocent data with potentially harmful data or data that could be used maliciously against an individual, such as sexual orientation, political views, or health status.

AI programs have also shown difficulty in completely avoiding the stereotyping of certain groups, leading to potential systematic discrimination and bias.

Some riskier areas of content can offer several potential benefits but can potentially backfire. Consumer profiling, for example, allows companies or governments to understand their customers, employees, or citizens. However, AI profiling can infringe on an individual’s privacy and threaten civil liberties. If an AI learns from biased data, it may end up perpetuating these biases, leading to discrimination in the workplace and potential AI discrimination lawsuits.

What is generative AI?

When many people think of AI right now, they are really referring to generative AI. Generative AI is the technology that powers ChatGPT, Bard, and other popular chatbot programs. Predictive AI simply analyzes existing data and can make predictions or forecasts. Generative AI, on the other hand, can create new data.

Text-based generative AI software like ChatGPT uses algorithms to predict the likely next words in text and generates output based on a prompt. The program knows what to produce because it was trained on huge quantities of data to eventually learn patterns.

Generative AI models, however, are sometimes referred to as “black box” systems, meaning nobody fully understands the exact process the machine uses to take a prompt and deliver a material output.

Like any AI, these programs are trained to excel in visual and audio perception, learning and adapting, reasoning, pattern recognition, and decision-making. One of the interesting characteristics of AI is that once the technology becomes reliable, it stops being referred to as AI and simply becomes mainstream computing technology. Simple speech recognition, for example, was once thought to be an innovative AI but is now simple software.

What is Big Data?

Tech companies, social media companies, and data brokers collect and store large amounts of data for various purposes. Big Data describes the huge amount of information produced and collected by companies that aim to profit from it.

What is Synthetic Data?

Some tech companies are so desperate for new generative data that they are developing “synthetic” information. This is text, images, and code that the existing AI software produces for itself — the AI systems are teaching themselves to a certain degree. This in itself could create some interesting problems in the future.

Will the AI models be able to delineate what data came organically versus what it created during a data creation frenzy? And if that works itself out, it is still possible that the model’s “self-sufficiency” will get caught in a feedback loop and simply reinforce its own limitations and mistakes.

AI Data Privacy Violations: Does Online Privacy Exist?

The legal departments and privacy teams at Big Tech have broadened what they can legally use consumer data for. It has been clear for a long time that these companies have little concern for the privacy of their user base. 

In 2023, Google’s privacy team broadened the company’s terms of service to allow Google to access publicly available Google Docs, Google Sheets, restaurant reviews on Google Maps, and other online materials to train its AI products, like Bard. 

Also, in 2024, Google stated that it would delete “billions of data records” as part of a settlement of a lawsuit that accused the company of monitoring the browsing habits of users who thought they were browsing privately through the company’s incognito mode. Plaintiffs alleged that incognito mode was merely for appearance and did not keep those browsing sessions private at all. 

Our data privacy attorneys are reviewing AI data scraping lawsuits and privacy violations for plaintiffs in all fifty states. We have the resources, the experience, and the willingness to take on America’s largest tech corporations and fight for your privacy rights and intellectual property concerns. 

What Are the Current Restrictions on Companies?

Under current consumer data privacy law, companies have several obligations that include:

  • Data Minimization: limiting the collection of personal data to only what is adequate, relevant, and necessary for the purpose of the data collection.
  • Consent: express consent must be obtained when the business processes sensitive data or deviates from its own privacy policy.
  • Purpose Limitations: personal data may only be processed for purposes necessary or compatible with the purposes disclosed in a company’s privacy policy.
  • Security: businesses must establish and maintain workable data security practices to protect consumers’ personal data. Businesses must also evaluate the risks associated with the sale of personal data, the processing of data, and the distribution of data to marketing firms.
Do AI Privacy Laws Exist?

Lawmakers have been slow to regulate data scraping, echoing past delays in privacy protections over the last 30 years. It makes sense to limit the collection and storage of only essential data by any third party. And, of course, this data should only be collected with informed consent, explaining the purpose and duration of retention before it’s destroyed, as some new privacy laws require. 

Cyberattacks and data breaches highlight the importance of limiting the amount of data stored by any one company. IT networks can be so vulnerable that all stored data is effectively at risk. Therefore, the best way to protect data may be to limit what is collected in the first place and effectively develop clear legislative guidance. 

The EU is one step ahead of the U.S. on the legislation front. The EU passed the Artificial Intelligence Act in March 2024, focusing on the principle that AI systems must be trained, designed, and developed in such a way that there are modern safeguards against the generation of content that violates existing privacy laws.

Lawmakers and consumer privacy attorneys have highlighted specific consumer data privacy rights, which include:

  • Right to Access
  • Right to Rectification
  • Right to Deletion
  • Right to Data Portability
  • Right to Object to Data Processing
  • Right to be Free from Discrimination
Fight For Your Right to Personal Privacy

Why Hire The Lyon Firm?

Filing a Class Action AI Lawsuit is a complex and serious legal course. It is prudent to hire an experienced legal team. The Lyon Firm is dedicated to assisting plaintiffs work toward a financial solution to assist in compensating for damages sustained.

We work with AI industry experts across the country to provide the most resources possible and to build your case into a valuable settlement. The current legal environment is favorable for consumers involved in AI data harvesting class actions, data misuse claims and other personal privacy practice areas. 

Joe Lyon has twenty years of experience filing class action lawsuits on behalf of Americans in all fifty states. Contact our legal team for a free consultation to review your potential data privacy or AI Lawsuit.