OpenOrca: open source dataset and instruct-tuned LLMs

Today I am announcing OpenOrca, an open-source dataset and series of instruct-tuned language models.

As I read Orca: Progressive Learning from Complex Explanation Traces of GPT-4 by Mukherjee et. al. of Microsoft, I had to consider the implications for Open Source AI.

This was pretty awesome stuff. But, I realized that while Microsoft would probably release their LLaMA-13b based model (as of the time of this writing they still haven’t) I concluded that they might not release the dataset.

Therefore, I resolved to replicate their efforts, download the data myself, and train the model myself, so that OpenOrca can be released on other sizes of LLaMA as well as other foundational models such as Falcon, OpenLLaMA, RedPajama, MPT, RWKV.

This was a nontrivial undertaking. With the help of an all-star team of open-source AI/ML engineers, we have completed the OpenOrca dataset.

Our dataset consists of:

~1 million of FLANv2 augmented with GPT-4 completions
~3.5 million of FLANv2 augmented with GPT-3.5 completions

We followed the submix and system prompt distribution outlined in the Orca paper. With a few exceptions. We included all 75k of CoT in the FLAN-1m dataset rather than sampling that. Also, we found that many items were duplicated so we removed duplicates, resulting in 3.5m instructs in the ChatGPT dataset.

We are presently performing full weights fine-tuning of OpenOrca on the foundation of LLaMA-13b, so that our performance can be compared with Microsoft’s model when it releases.

We expect to release OpenOrca-LLaMA-13b in mid-July 2023. At that time we will publish our evaluation findings and the dataset.

We are currently seeking GPU compute sponsors for training OpenOrca on the following platforms:

Falcon 7b, 40b
LLaMA 7b, 13b, 33b, 65b
MPT-7b, 30b
Any other targets that get a sponsor. (RWKV, OpenLLaMA)

From the Orca paper and our experiments, we roughly estimate the compute costs as follows:

Model Size	Compute Estimate
7b	1k GPU-Hours
13b	2k GPU-Hours
30/33b	4k-6k GPU-Hours
40b	8k-10k GPU-Hours
65b	10k-15k GPU-Hours

We will share our appreciation for sponsorship in this space, as well as the model cards.

Our current sponsors:

Financial Contribution and mentorship – chirper.ai
LLaMA 7b – preemo.io
LLaMA 33b – latitude.sh

Please reach out to me if you are interested in providing compute sponsorship for any specific targets of OpenOrca.

I would like to thank the motley crew of Open Source AI/ML engineers who have worked beside me in this endeavor. Including:

Wing “Caseus” Lian and NanoBit of OpenAccess AI Collective
AutoMeta, Entropi, AtlasUnified, and neverendingtoast of Alignment Lab AI
Rohan
Teknium
Pankaj Mathur
Tom “TheBloke” Jobbins for quantizing and amplifying
All the other people in the Open Source AI community who have taught me and helped me along the way.

OpenOrca: open source dataset and instruct-tuned LLMs

Leave a Reply Cancel reply

Quick Links

Stocks

Other Posts

Leave a Reply Cancel reply