ChatGPT is getting a Dutch sister: research club TNO, the Knowledge Center for Forensic Research NFI, and ICT education association SURF are putting €13.5 million into GPT-NL. The hope is that this model will become a “transparent, fair, and verifiable” alternative with which Dutch students and scientists can work. But are these achievable goals?
Driving the plan for a Dutch GPT are concerns that have been around ChatGPT for some time. Data is widely used without the owner’s permission and the model is hardly transparent because the American company and creator OpenAI does not want to reveal what is behind its success.
GPT-NL wants to do things completely differently. The model will only be trained on data for which permission has been obtained or for which the owner no longer exists and the property rights have expired. This should make it clear to everyone where GPT-NL’s data comes from.
I think that promise is too broad. First, let me say that I do not agree with the way copyrights are violated by OpenAI. At the same time, I cannot deny that the huge amount of data collected by this is exactly what is driving ChatGPT.
I previously wrote that ChatGPT has been trained on more than eight million web pages. Suppose you train GPT-NL on one million Dutch pages. How do you plan to ask permission for all those pages? I envision a call center full of students hanging on the line with the Volkskrant and Het Parool. Maybe this is just my imagination and deals will be struck, but newspapers and dailies are not charities. Do the GPT-NL researchers want to start paying for every article? If so, €13.5 million will be burned through quickly.
GPT-NL admits that the budget is a pittance compared to that of tech giants, dampening expectations. But no worries: The researchers promise that the transparency of their model more than makes up for it.
A fully transparent ChatGPT? The lack of transparency is in the basic architecture, the model’s building blocks, and it remains difficult to figure out the criteria on which the model bases choices. Research shows that the transparency of even the best large language models is lousy.
If we want a transparent language model so badly, wouldn’t we be better off investing time in scientific research into a different kind of model than ChatGPT’s black, impenetrable box? Do we want to put so many man-hours into a ChatGPT version whose sales pitch we are already watering down? As a student for your own thesis, would you want to build on a model that you know is nowhere near the state-of-the-art in terms of technology?
I think there should be solutions to ChatGPT’s problems, but we should not look for them in replicating a model whose basis is already unreliable. We have so many talented students and scientists. So in this case: better well thought out than badly copied.
Pepijn Stoop is a UvA student of artificial intelligence.