Can Generative AI Collect Our Data From The Internet?

Chat GPT can write sonnets, code websites, and even pass the bar exam. It learned how to do this by training on huge amounts of data. A lot of this data is personal information about individuals scraped from the Internet, often without them knowing.

Catching on to this, last month, Italy’s data protection regulator stopped Chat GPT’s operations over a breach of their data norms.

India is still finalising its data protection law. Against the backdrop of Italy’s action, we discuss how Chat GPT would fare under India’s proposed law, and if there are lessons for us to draw from this episode.

Chat GPT under the scanner across the EU 

Italy’s ban on ChatGPT was prompted by a few reasons:

  • There was no legal basis to justify the massive collection of data to train Chat GPT’s algorithms.
  • Open AI did not have appropriate age-gating mechanisms to ensure that children’s data was not collected to train algorithms.
  • The company didn’t give people adequate notice before collecting their data.
  • Chat GPT gave out factually incorrect information.

Italy had also earlier restricted “Replika”, an AI-powered chatbot, over similar grounds. Taking a cue from Italy, regulators in GermanySpainFrance, and Ireland are exploring actions.

Italy has now asked OpenAI to abide by certain norms for the ban to be lifted. Open AI must publish information about its data processing and must clarify the legal basis for processing personal data for training its AI. It must allow users to seek correction of inaccurate data or its deletion and allow users to object to OpenAI’s use of their personal data to train its algorithms.

While Italy’s approach raises several interesting questions, we focus on one key issue – training AI  models by using data that’s available freely and publicly. Think public social media profiles, news pieces, Reddit posts, and so on.

Is data from public sources ‘private’? 

Chat GPT’s technical paper says its training data includes “publicly available personal information”.  Under EU law, any data that can identify an individual is ‘personal information’. To collect and use such data, a business must meet privacy norms, regardless of whether it’s collected from the individual directly or is available publicly and freely.

Interestingly, under India’s current data protection law – rules under the Information Technology Act, data that is “freely available” or “accessible in public domain” is not considered sensitive data. And so, for collecting and using such publicly available information, you need not abide by data protection rules.

But the draft Digital Personal Data Protection Bill 2022 (India’s current draft data protection law) takes a different position. One that’s similar to the EU approach. Even if you collect data from public sources, if it relates to an identifiable individual, it is ‘personal’. And all do’s and don’ts that attach to collection and use of personal data apply to it (with one exception – around deemed consent).

How can data be collected and used to train AI models? 

In the EU, even if a business is collecting/ scraping personal information off the Internet, it must still justify its collection and use under one of six legal ‘bases’ set out in the GDPR. User consent is one basis. Another is fulfilling a contract. But the one that is often used for training AI algorithms or for improving a product is “legitimate interests” of a business.

As such, India’s draft law doesn’t require the data collector to have legal bases. However, to collect and use personal data, a platform must get users’ consent or deemed consent, i.e. either you get actual consent from individuals or your collection/ use of data falls within one of the ‘deemed consent’ grounds recognised in law, such as processing data for complying with a court order or responding to a medical emergency or a public health response or processing data for ‘reasonable purposes’ recognised by the Indian government.

‘Deemed consent’ may help in training AI 

Taking repeated consent to collect data for training AI models is cumbersome. So developers are likely to consider two “deemed consent” grounds that could be relevant here.

One, under the draft law, consent can be assumed when you are processing “publicly available personal data” in “public interest.  Say, if a platform scoops up a public Reddit thread where users discuss their worst dating encounters, to train its algorithm. Does the AI developer not need to take users’ consent separately to process this data since it is publicly available?

Two, consent can be inferred when an individual voluntarily provides her information and can be reasonably expected to do so. For e.g., a  user signs up on Reddit. Reddit’s privacy policy says “Much of the information on the Services is public and accessible to everyone, even without an account. By using the Services, you are directing us to share this information publicly and freely.”  Can the user’s catch-all consent to the privacy policy be considered as consent to sharing of their data with AI models like Chat GPT and Open AI’s use of that data for training algorithms?

Interestingly, platforms like Reddit are going to start charging AI developers for accessing their content. But the question of consent/ deemed consent would remain.

Using data to train AI models- A reasonable purpose?

As India seeks to establish itself as an AI powerhouse, it would be worth exploring if the use of data to train AI models should be a ‘reasonable purpose’ under India’s data protection law. This should be subject, of course, to appropriate checks and balances. For instance, similar to Italy’s guidance, individuals could be allowed the right to object to the use of their personal data for training AI models an opt-out rather than an opt-in.

This post has been authored by Sreenidhi Srinivasan (Partner) and Pallavi Sondhi (Senior Associate)

For more on the topic please reach out to us at

the status quo

Dividing by zero...