On Fair Use of Training Data for Large Language Models
time:2025-10-08Author Information
Li Mingxuan, School of Interdisciplinary Studies, and Big Data and Responsible Artificial Intelligence for National Governance, Renmin University of China
Abstract
The primary sources of training data for large language models are publicly available data on the internet. Developers typically collect these data on a large scale through web crawling and aggregation of opensource datasets. However, as the protection of data property rights becomes increasingly reinforced, the legitimacy of this approach faces growing legal challenges. The large number of data rightsholders and the difficulty in tracing data usage significantly increase transaction costs, making it impractical for developers to obtain individual licenses through market mechanisms to ensure lawful use of training data. In this context of market failure, permitting the fair use of data for training large language models can increase social welfare and generally does not harm the market interests of data rightsholders. Alternatives such as collective management or statutory licensing offer limited benefits to rightsholders while imposing higher institutional costs and potentially hindering the development of large language models in China. Therefore, a fair use for training data should be established to provide legal certainty for technological innovation. In terms of rule design, fair use should be limited to publicly available data, be solely for the purpose of pretraining, include data processing methods involved in training, and allow data rightsholders to opt out through technical measures.
Keywords: Large Language Models; Training Data; Data Property Right and Interest; Fair Use; Market Failure