Introduction:
Language models like GPT-4 and Claude are effective and practical, but the data used to train them is kept behind lock and key. With a new, AI2 Drops Biggest Open Dataset Yet for Training enormous text dataset available for free and open scrutiny, the Allen Institute for AI (AI2) hopes to buck this tendency.
The dataset, known as Dolma (short for “Data to feed OLMo’s Appetite”), is meant to serve as the foundation for the research team’s envisioned open language model, or OLMo. Researchers from AI2 contend that the dataset they used to develop the model should also be freely usable and modifiable by the AI research community.
This is the first “data artefact” that AI2 is making available about OLMo. In a blog post, the company’s Luca Soldaini describes the team’s decision to apply multiple processing steps to make the data digestible for AI. (They indicate at the opening that a more thorough paper is in the works.)
Although organizations like OpenAI and Meta publicly make some crucial dataset statistics available, much of the data is confidential. There is suspicion that this closed approach may be because the data were not ethically or lawfully gathered, such as that many writers’ books were consumed as pirated versions, in addition to the known effect of deterring inspection and development generally.
It makes sense that these businesses would want to keep the details of their model training techniques under wraps in a fiercely competitive AI market. However, it makes the information and models more opaque and challenging to examine or duplicate for academics outside the companies.
AI2 Drops Biggest Open Dataset Yet for Training:
AI2 Drops Biggest Open Dataset Yet for Training [Source of Image : Techcrunch.com]
Dolma by AI2 is meant to be the reverse; all its sources and working methods, such as how and why it was condensed to original English language texts, are openly disclosed.
Although it is not the first open dataset experiment, it is by far the largest (3 billion tokens, an AI-native metric of content volume) and, according to the developers, the most user-friendly in terms of use and permissions. Here is some information on the “ImpACT licence for medium-risk artefacts,” which it uses. However, in essence, potential Dolma users must:
- Give your contact details and any use cases you have in mind.
- Identify any works that are derived from Dolma.
- Distribute those derived works by the same licence.
- Decide not to use Dolma in various taboo situations, such as monitoring or spreading false information.
There is a removal request form accessible here for those who are concerned that, despite AI2’s best efforts, some of their data may have made it into the database. It’s not just a blanket “don’t use me” statement; it’s for specific situations.
If all that makes sense to you, you can access Dolma using Hugging Face.
My name is Sai Sandhya, and I work as a senior SEO strategist for the content writing team. I enjoy creating case studies, articles on startups, and listicles.