This article is part of our coverage of the latest in AI research.
This week, the BigScience research project released BLOOM, a large language model that, at first glance, it looks like another attempt to reproduce OpenAI’s GPT-3.
But what makes BLOOM different from other LLMs is the effort that went into researching, developing, training, and releasing the machine learning model.
While in recent years, big tech companies have hidden LLMs like closely guarded trade secrets, BigScience has put transparency in openness at the center of BLOOM since the beginning of the project.
The result is a large language model that is highly accessible for research and study and available to everyone. The open-source and open-collaboration example that BLOOM has set can be very beneficial to the future of research in LLMs and other areas of artificial intelligence. But some of the challenges that are inherent to large language models remain to be solved.
BLOOM stands for “BigScience Large Open-science Open-access Multilingual Language Model.” From the figures, it doesn’t look much different from GPT-3 and OPT-175B. It is a very large transformer model with 176 billion parameters that has been trained on 1.6 terabytes of data, including natural language and software source code.
Like GPT-3, it can perform many tasks with zero- and few-shot learning, including text-generation, summarization, question-answering, and programming.
But what makes BLOOM significant is the organization behind it and the process that went into building it.
BigScience is a research project that was bootstrapped in 2021 by Hugging Face, the popular hub for machine learning models. According to its website, the project “aims to demonstrate another way of creating, studying, and sharing large language models and large research artefacts in general within the AI/NLP research communities.”
In this regard, BigScience takes “inspiration from scientific creation schemes such as CERN and the LHC, in which open scientific collaborations facilitate the creation of large-scale artefacts that are useful for the entire research community.”
In the span of a year, starting from May 2021, more than 1,000 researchers from 60 countries and more than 250 institutions worked together at BigScience to create BLOOM.
While most major LLMs have been trained exclusively on English text, BLOOM’s training corpus includes 46 natural languages and 13 programming languages. This makes it useful for the many regions where English is not the main language.
BLOOM is also a break from the de facto reliance on big tech to train models. One of the main problems of LLMs is the prohibitive costs of training and tuning them. This hurdle has made 100-billion-parameter LLMs the exclusive domain of big tech companies with deep pockets. Recent years have seen AI labs gravitate toward big tech to gain access to subsidized cloud compute resources and fund their research.
In contrast, BigScience got a 3 million euro grant from the Centre National de la Recherche Scientifique (French National Center for Scientific Research) to train BLOOM on the supercomputer Jean Zay. There were no deals to give commercial companies exclusive license to the technology, and no commitment to commercialize the model and turn it into a profitable product.
Furthermore, the team has been completely transparent about the entire process of training the model. They have published the dataset, the meeting notes, discussions, and code, as well as the logs and technical details of training the model.
Researchers are studying the model’s data and metadata and publishing interesting findings.
I've been playing with the training dataset behind the extremely cool new BLOOM model from @BigscienceW and @huggingface. Here's a sample of 10 million chunks from the English-language corpus, about 1.25% (!!) of the total. Encoded with `all-distilroberta-v1`, then UMAP to 2d. pic.twitter.com/a00zBWw83c
And of course, the trained model itself is available for download on Hugging Face’s platform, which relieves researchers of the pain of spending millions of dollars on training.
Last month, Facebook open-sourced one of its LLMs under some restrictions. However, the level of transparency that BLOOM brings is unprecedented and will hopefully set a new standard for the industry.
“BLOOM is a demonstration that the most powerful AI models can be trained and released by the broader research community with accountability and in an actual open way, in contrast to the typical secrecy of industrial AI research labs,” said BLOOM Training co-lead, Teven Le Scao.
While the efforts of the BigScience to bring openness and transparency to AI research and large language models are commendable, the inherent challenges of the field remain unchanged.
LLM research is trending toward bigger and bigger models, which will further increase the costs of training and running them. BLOOM was trained on 384 Nvidia Tesla A100 GPUs (~$32,000 each). Larger models will require even larger compute clusters. BigScience has declared that it will continue to create other open-source LLMs, but it remains to be seen how it will fund its growingly costly research. (OpenAI, which started out as a non-profit organization, ended up becoming a for-profit organization that sells products and relies on funding from Microsoft.)
Another problem that remains to be solved is the huge costs of running the models. The compressed BLOOM model is 227 gigabytes large. Running it requires specialized hardware with hundreds of gigabytes of VRAM. For comparison, GPT-3 requires a computing cluster that is the equivalent of Nvidia DGX 2, which is priced at around $400,000. Hugging Face plans to launch an API platform that enables researchers to use the model for around $40 per hour, which is not a small cost.
The costs of running BLOOM will also affect the applied ML community, startups and organizations that want to build products powered by LLMs. Currently, the GPT-3 API offered by OpenAI is much more attuned to product development. It will be interesting to see which directions BigScience and Hugging Face will take to enable developers to create products on top of their valuable research.
In this regard, I’m looking forward to the smaller versions of the model that BigScience plans to release in the future. Contrary to the way they are often portrayed in the media, LLMs still follow the “no free lunch” theorem. This means that when it comes to applied ML, a more compact model that has been finetuned for a specific task is more efficient than a very large model that has average performance on many tasks. An example is Codex a modified version of GPT-3 that provides superb programming assistance at a fraction of GPT-3’s size and costs. GitHub is currently offering Copilot, a product built on Codex, at $10 per month.
With the new culture that BLOOM hopes to establish, it will be interesting to see which directions academic and applied AI will take in the future.
This site uses Akismet to reduce spam. Learn how your comment data is processed.