OpenAI’s ChatGPT is a popular chatbot that is trained on the vast, imperfect glory of the public internet. However, this has resulted in ChatGPT making numerous embarrassing mistakes. For instance, a lawyer who recently used the chatbot to write his court brief realized he’d blundered when it cited six nonexistent cases. To improve its accuracy, OpenAI needs to train the chatbot on better-quality data. This presents a tantalizing possibility of a new revenue stream for publishers and any other company that owns valuable, accurate text that could be used to train language models.
The Role of Academic Books and Journals
One of the best sources of information to train ChatGPT would be academic books and journals, which have concentrated expertise in business, medicine, economics, and more. The scuttlebutt in the AI field for months has been that a large chunk of GPT-4’s training data came from Reddit. However, last month, the popular internet forum said it would start charging companies to access its trove of conversations. This led to book publishers wondering if they might be able to do the same for their past work. OpenAI may have to start looking beyond the public internet to teach the next iteration of ChatGPT. The online datasets it was trained on have always held fairly reliable data. But now that ChatGPT is a public sensation, those datasets face being spammed with junk data aimed at skewing a chatbot’s results. OpenAI may well need to look further afield and start paying for its next round of training.
The Emergence of a Market for Data
Many companies are buying access to language models like GPT-4 and then tweaking them with specialist data for their own purposes. In one sense, this all points to a thriving market for data. In a year or two, we could see an array of insurance firms, banks, and medical companies buying and selling data to build specialized alternatives to ChatGPT. However, this market could move in a darker direction too, one dominated by incumbent technology firms. That’ll depend on if OpenAI and Google build language models that can do anything for anyone. General-purpose bots could supplant the niche bots, and if data prices go too high, that would also make those niche bots harder to build.
Investment banks in particular, who want to help their clients do smarter investment research, have been building sophisticated chatbots and training them on data from companies in the insurance, freight, telecommunications, and retail industries. Virtually no one outside of the big tech firms like OpenAI and Google are actually building the underlying language models from scratch, but many companies are buying access to those models and then tweaking them with specialist data for their own purposes.
This is a new trend that has emerged in the last three months. Now those transactions make up about 15 percent of the total volume on online marketplaces like Nomad, with prices ranging from tens of thousands to millions of dollars. Companies with unique data that’s in high demand tend to be in a stronger selling position.
Keith Peiris, co-founder and CEO of Tome, an AI tool for generating stories, believes that the larger tech firms “are always going to be able to spend more on compute [and data] than we can. Odds are they will win because of capital, not necessarily because of innovation.”
The future of AI and big data is bright, and there is a growing market for data. However, the emergence of a market dominated by incumbent technology firms poses a threat to small companies. The need for better quality data to train chatbots presents a tantalizing possibility of a new revenue stream for publishers and any other company that owns valuable, accurate text that could be used to train language models.
Leave a Reply