Use ChatGPT less and support open source more, New York University professor Nature published: For the future of the scientific community

For the future of science, join the open source LLM camp!

The free ChatGPT is very cool to use, but the biggest disadvantage of this closed-source language model is that it is not open source, and the outside world cannot understand the training data behind it and whether it will leak user privacy. Jointly open source a series of alpaca models such as LLaMA.

Recently, an article was published in the Nature Worldview column. Arthur Spirling, a professor of politics and data science at New York University, called on everyone to use more open source models. The experimental results are reproducible and in line with academic ethics.

The point is, if one day OpenAI gets upset and closes the language model interface, or raises prices by relying on a closed monopoly, then users can only say helplessly, “In the end, academics lost to capital.”

The author of the article, Arthur Spirling, will join Princeton University in July this year to teach political science. His main research direction is political methodology and legislative behavior, specifically text data (text-as-data), natural language processing, Bayesian statistics, machine learning, Applications of item response theory and generalized linear models to political science.

Researchers should avoid the temptation of commercial models and co-develop transparent large-scale language models to ensure reproducibility.

Embrace open source and reject monopoly

It seems like every day a brand new large language model (LLM) is launched, and its creators and academic stakeholders each time speak up about how the new model can communicate fluently with humans, such as helping users change code, writing recommendations Letters, abstracts for articles, etc.

As a political and data scientist who is using and teaching how to use these models, I think academics should be vigilant because the most popular language models are still private and closed, that is, run by companies, they will not disclose Specific information about the base model is only independently checked or validated for the capabilities of the model, so researchers and the public do not know which files were used for the training of the model.

Rushing to incorporate language models into one’s own research pipeline can be problematic, potentially threatening hard-won progress in “research ethics” and “reproducibility of results.”

Not only cannot rely on commercial models, researchers must also work together to develop open source large-scale language models that are transparent and independent of the interests of a specific company.

Although commercial models are very convenient and can be used out of the box, investing in open source language models is a historical trend. It is necessary to find ways to promote development and apply the model to future research.

I optimistically estimate that the future of language model tools must be open source, similar to the development history of open source statistical software. Statistical software that has just started to be commercially popular is very popular, but currently almost all communities are using open source platforms such as R or Python.

For example, the open source language model BLOOM released in July last year, its development team Hugging Face is a New York-based artificial intelligence company, together with more than a thousand volunteers and researchers to build, part of the research and development funds provided by the French government; Other teams are also working to open source large language models.

I think open source projects like this are great, but we need more collaboration and pooling of international resources and expertise.

Teams that open source large language models are usually not as well-funded as large companies, and development teams need to operate continuously to keep track of the latest developments in the field: the development of the field of AI is so fast that most language models are released within a few weeks or It will be outdated in a few months.

Therefore, the more scholars who participate in open source, the better the effect of the open source model will be.

The use of open source LLM is crucial for “reproducible research”, because closed source commercial language model owners can change their products or their training data at any time, which may change the model’s generated results.

For example, one research group might publish a paper testing whether wording suggested by a commercial language model can help clinicians communicate more effectively with patients; if another group tries to replicate the research, who knows what training data the model is based on? Is it the same as then? Whether the model is even still operational is unknown.

GPT-3, an auxiliary tool commonly used by researchers before, has been replaced by GPT-4. All research based on the GPT-3 interface may not be reproduced in the future. For the company, maintaining the operation of the old model is not a high priority.

In contrast, using open-source LLM, researchers can view the model’s internal architecture, weights, understand how the model operates, customize the code, and point out errors. These details include the adjustable parameters of the model and the data used to train the model. The participation of the community Both monitoring and monitoring help to make this model robust over the long term.

The use of commercial language models in scientific research also has negative implications for research ethics, as the text used to train these models is unknown and may include direct messages between users on social media platforms or content written by children.

While those who produce public text may have agreed to the platform’s terms of service, this may not be the standard of informed consent researchers would like to see.

In my opinion, scientists should try to stay away from using these models in their own work as much as possible. We should move to open language models and make them available to others.

Also, I don’t think academics, especially those with large social media followings, should push others to use commercial models, and if prices skyrocket, or companies fail, researchers may regret promoting technology to colleagues.

Researchers can currently turn to open language models produced by private organizations, such as LLaMA, which is open sourced by Meta, the parent company of Facebook. It was initially issued based on user applications and reviews, but the full version of the model was subsequently leaked online; Meta The Open Language Model OPT-175 B

The long-term downside is that the release of these models relies too much on the benevolence of companies, which is a precarious situation.

Beyond that, there should be academic codes of conduct for working with language models, and corresponding regulatory measures, but these will take time, and based on my experience as a political scientist, I expect that these regulations will definitely be very imperfect at first, And it works slowly.

At the same time, support is urgently needed for large-scale collaborative projects to train open-source language models for research, such as CERN, the International Organization for Particle Physics, and governments should increase funding through grants.

The field is developing at lightning speed and now needs to start coordinating domestic and international support.

The scientific community needs to be able to assess the risks of the resulting models, and releases to the public need to be cautious, but clearly an open environment is the right thing to do.

References:

This article comes from the WeChat public account:Xin Zhiyuan (ID: AI_era)