Although Google is initiating a conversation about giving credit and complying with copyright while developing large language models (LLMs) for generative AI products, they seem to be primarily concerned with the robots.txt file.
However, some disagree with their approach, as they believe the robots.txt file is not the appropriate solution.
In a compelling article by Pierre Far, titled “Crawlers, Search Engines, and the Sleaze of Generative AI Companies,” he shed light on the significant challenges currently facing the online publishing industry.
Exploring Alternatives: Considering Options Beyond Robots.txt
The discussion on how to respect publishers’ copyright should not begin with robots.txt due to several reasons:
- Not all crawlers use LLM and identify themselves
The responsibility lies with the website operator to detect and block specific crawlers that might utilize or commercialize their data for generative AI products. This places an additional and often unnecessary workload on smaller publishers.
Furthermore, this assumption presupposes that publishers have the capability to edit their robots.txt file, which may not always be the case with hosted solutions.
- This approach is not viable in the long run, given the continuous growth in the number of crawlers
The recently proposed robots.txt standard imposes a restriction on the file size, limiting it to 500 kb.
As a result, significant challenges may arise for large publishers who need to block numerous LLM crawlers and/or refined URL patterns, in addition to managing other bots, potentially leading to issues with their robots.txt file.
- It’s important to embrace a more balanced perspective than the ‘all or nothing’ approach
In the case of larger crawlers such as Googlebot and Bingbot, it becomes challenging to differentiate between data utilized for search engine results pages, where there is typically an “agreement” with the publisher through citation to the original source, and data used for generative AI products.
Blocking Googlebot or Bingbot due to their generative AI products also means losing potential visibility in their respective search results. This creates a dilemma for publishers, where they’re compelled to make a difficult choice between complete access or complete blocking, which is an unacceptable situation.
- txt is focused on managing crawling, while the copyright discussion is centered on data utilization
The focus here should be on the indexation and processing phase rather than robots.txt, which is more of a last resort if all other options fail. It is not the ideal starting point for this discussion.
Robots.txt files are effective for regular crawlers and do not require modification for LLMs.
While LLM crawlers must identify themselves, the primary topic of concern should be the indexation/processing of the data they crawl.
Creating a Better Approach
Fortunately, the web has well-established solutions to address data usage and copyright management concerns. These solutions are known as Creative Commons licenses.
These Creative Commons licenses are suitable for LLMs’ purposes:
- CCO license: Allows LLMs to freely distribute, remix, adapt, and build upon the material without any conditions or restrictions.
- CC BY license: Allows LLMs to distribute, remix, adapt, and build upon the material in any medium or format, with the requirement of attributing credit to the creator. Commercial use is permitted as long as proper credit is given.
- CC BY-SA license: Allows LLMs to distribute, remix, adapt, and build upon the material, requiring attribution to the creator. Commercial use is allowed, but any modified material must be licensed under the same terms.
- CC BY-NC license: Allows LLMs to distribute, remix, adapt, and build upon the material for noncommercial purposes only, with attribution to the creator.
- CC BY-NC-SA license: Allows LLMs to distribute, remix, adapt, and build upon the material for noncommercial purposes only, with attribution to the creator. Any modified material must be licensed under identical terms.
- CC BY-ND license: Allows LLMs to copy and distribute the material in its original form, providing credit to the creator. Commercial use is allowed, but no derivatives or adaptations are permitted.
- CC BY-NC-ND license: Allows LLMs to copy and distribute the material in its original form for noncommercial purposes only, providing credit to the creator. No derivatives or adaptations of the work are allowed.
CC BY-ND and CC BY-NC-ND licenses aren’t ideal for LLMs. On the other hand, CCO, CC BY, CC BY-SA, CC BY-NC, and CC BY-NC-SA licences require LLMs to be mindful of how they utilize the crawled or obtained data, ensuring compliance with publishers’ requirements, including proper attribution and sharing products based on the data.
This places the responsibility on a limited number of LLMs rather than the numerous publishers.
CCO, CC BY, and CC BY-SA licenses support “traditional” data usage, like search engine results with proper attribution, but CC BY-NC and CC BY-NC-SA licenses also facilitate research and development for open-source LLMs.
Harnessing the Power of Meta Tags
Identifying an appropriate license is not enough. It needs to be effectively communicated, and robots.txt is not the best approach for this. Also, blocking a page from search engine crawling does not mean it can’t be useful for LLMs, as the use cases are different.
To address this,using a meta tag is recommended, allowing a more refined and accessible approach for publishers. Meta tags can be inserted at the page level, within themes, or within content and do not require extensive access rights.
While meta tags don’t stop crawling, they help communicate the data’s usage rights. Existing copyright tags like Dublin Core, rights-standard, and copyright-meta may not fully serve the purpose, making a new meta tag potentially necessary.