OpenOrca: open source dataset and instruct-tuned LLMs

OpenOrca: open source dataset and instruct-tuned LLMs

Today I am announcing OpenOrca, an open-source dataset and series of instruct-tuned language models.

As I read Orca: Progressive Learning from Complex Explanation Traces of GPT-4 by Mukherjee et. al. of Microsoft, I had to consider the implications for Open Source AI.

This was pretty awesome stuff. But, I realized that while Microsoft would probably release their LLaMA-13b based model (as of the time of this writing they still haven’t) I concluded that they might not release the dataset.

Therefore, I resolved to replicate their efforts, download the data myself, and train the model myself, so that OpenOrca can be released on other sizes of LLaMA as well as other foundational models such as Falcon, OpenLLaMA, RedPajama, MPT, RWKV.

This was a nontrivial undertaking. With the help of an all-star team of open-source AI/ML engineers, we have completed the OpenOrca dataset.

Our dataset consists of:

  • ~1 million of FLANv2 augmented with GPT-4 completions

  • ~3.5 million of FLANv2 augmented with GPT-3.5 completions

We followed the submix and system prompt distribution outlined in the Orca paper. With a few exceptions. We included all 75k of CoT in the FLAN-1m dataset rather than sampling that. Also, we found that many items were duplicated so we removed duplicates, resulting in 3.5m instructs in the ChatGPT dataset.

We are presently performing full weights fine-tuning of OpenOrca on the foundation of LLaMA-13b, so that our performance can be compared with Microsoft’s model when it releases.

We expect to release OpenOrca-LLaMA-13b in mid-July 2023. At that time we will publish our evaluation findings and the dataset.

We are currently seeking GPU compute sponsors for training OpenOrca on the following platforms:

  • Falcon 7b, 40b

  • LLaMA 7b, 13b, 33b, 65b

  • MPT-7b, 30b

  • Any other targets that get a sponsor. (RWKV, OpenLLaMA)

From the Orca paper and our experiments, we roughly estimate the compute costs as follows:

Model Size Compute Estimate
7b 1k GPU-Hours
13b 2k GPU-Hours
30/33b 4k-6k GPU-Hours
40b 8k-10k GPU-Hours
65b 10k-15k GPU-Hours

We will share our appreciation for sponsorship in this space, as well as the model cards.

Our current sponsors:

Please reach out to me if you are interested in providing compute sponsorship for any specific targets of OpenOrca.

I would like to thank the motley crew of Open Source AI/ML engineers who have worked beside me in this endeavor. Including:

  • Wing “Caseus” Lian and NanoBit of OpenAccess AI Collective

  • AutoMeta, Entropi, AtlasUnified, and neverendingtoast of Alignment Lab AI

  • Rohan

  • Teknium

  • Pankaj Mathur

  • Tom “TheBloke” Jobbins for quantizing and amplifying

  • All the other people in the Open Source AI community who have taught me and helped me along the way.

Read More

Share:

Leave a Reply

Your email address will not be published. Required fields are marked *