Push Your Custom Dataset to Huggingface Two Ways

TwoDataDetectives
5 min readJul 3, 2023

Three years ago, I said goodbye to my research career in robotics and NLP, and moved to a whole new domain: seizure detection using deep learning and wearable sensors. My dataset was 13,254 hours of bio-sensor recordings that were collected over the years by the research team and our collaborators in Boston. The dataset was 300GB and I had a 2019 16-inch Macbook Pro with 500GB storage.

I was a newbie — when I tried to load the 300GB dataset for training, my laptop’s memory ran out and it crashed.

https://imgflip.com/s/meme/Jackie-Chan-WTF.jpg

I wouldn’t say it was a complete surprise when I thought about it, but I did not have RAM that was more than 300G.

That was when the senior machine learning engineer in our team, Stefan told me “You have to use generators,” and I said, “Wait, what?”

My native language is C++ and with my 4 years of electrical engineering training, I knew I was thinking of something very different. I spent the night reading this article about Python generators and I felt I got what Stefan was saying!

If you are also new to generators, I strongly recommend you to head over to the page now and spend a few minutes reading what a generator is, including the first few example use cases.

I see Python generators as water pipes — if you can’t fit all your data into a bucket, you can use a water pipe and stream the data into your machine learning model. I had thousands of files that stored the data, so I loaded a file at a time, streamed the data points to the model to train, loaded another file, streamed the data points and so on, and then collected the results at the other end of the pipe —this, instead of loading all files into memory first and pumping them into my model all at once.

The generator method worked beautifully on my 300GB dataset!

Nowadays, datasets are only getting bigger, and when I was reading the documentation on how to create my own dataset and upload to Huggingface, I saw they recommended generators — this is exactly what I prefer as well!

If you have a large dataset that is not likely to fit into memory, you will love generators too ;)

But if you don’t, I still have a simple option for you, keep reading :)

You might already know, Huggingface has a lot of great datasets that you can load with a simple line of code:

from datasets import load_dataset
dataset = load_dataset('sherryibm/dataset_name')

where sherryibm is just my Huggingface user ID.

I want to talk about two easy ways to upload your dataset.

Option 1: if your dataset is small and straightforward — use a dictionary

Option 2: if your dataset is too large to fit into memory — use a generator

Let’s talk about Option 1 first.

Option 1: via a Dictionary

Using a dictionary to create and upload a dataset to Huggingface Hub is the easiest option. Let’s use a simple example — creating a code translation dataset (translate from SAS to Python) to finetune Starcoder.

Starcoder chat takes data in this format (as shown in their github):

{
"messages": [
{
"content": "Is it possible to imagine a society without law?",
"role": "user"},
{
"content": "It is difficult to imagine a society that is able to be maintained without any semblance of Law.",
"role": "assistant",
},
{
"content": "It seems like you consider the absence of law equal to the absence of anything that could guide the behaviour of the individual.",
"role": "user",
},
{
"content": "You are correct that there are other factors that can guide behavior in a society and play a role in shaping individuals' behavior and interactions with each other. However, even in societies where these factors are present, laws still serve an important role in maintaining social order and resolving conflicts.",
"role": "assistant",
}
]
}

Therefore each training sample in our custom dataset will need to look like the below:

message = [
{
"content": "Translate this from sas to python\n Some sas code",
"role": "user"},
{
"content": "python code.",
"role": "assistant",
}
]

Let’s create a mock dataset with 600 identical samples, and write it into a dictionary of messages:

messages  = [message for i in range(600)]
dataset_messages = {"messages": messages}

This dictionary can be directly loaded into a Huggingface dataset:

from datasets import Dataset
msg_ds = Dataset.from_dict(dataset_messages)
msg_ds = msg_ds.train_test_split(test_size=0.1, seed=42)

And… uploaded to Huggingface Hub:

msg_ds.push_to_hub("sherryibm/dataset_name", private=True)
msg_ds["train"].push_to_hub("sherryibm/dataset_name", split="train_ift", private=True)
msg_ds["test"].push_to_hub("sherryibm/dataset_name", split="test_ift", private=True)

Here I’m setting my dataset to private, so only I can use the dataset.

Oh, and before you push to Hub, remember to:

pip install huggingface_hub
huggingface-cli login

More details on login can be read from this documentation page.

If you will never use a dataset that is so big that can’t fit into memory, you can stop reading — there is no shame in using a simple dictionary to achieve just what you need. However, if you are likely to come across what I did with my 300GB dataset, have a look at the next section and keep it as an option for yourself.

Option 2: via a generator

Let’s use the same Starcoder example, but this time, pretend our data is stored in 1000 files and each file is 500MB with 10000 training samples like below:

{sas_code: "some SAS code 1", py_code: "some Python code 1"}
{sas_code: "some SAS code 2", py_code: "some Python code 2"}
{sas_code: "some SAS code 3", py_code: "some Python code 3"}
...
{sas_code: "some SAS code 10000", py_code: "some Python code 10000"}

To load and upload these files, you will need a generator function like below:

import json
def convert_row_gen(filenames):
for filename in filenames:
for row in open(filename, 'r'):
row = json.loads(row)
chat_start = "Translate this to Python \n" + row["sas_code"]

# load one training sample
message = [{'content': chat_start, 'role': 'user'}, {'content': row["py_code"], 'role': 'assistant'}]

# use yield instead of return
yield {"messages": message}

What I did here was load the data files one at a time, process one row to be in Starcoder format, and give this generator to Huggingface Dataset library.

To create the dataset, instead of using from_dict we use from_generator like below:

msg_ds = Dataset.from_generator(lambda: convert_row_gen(filenames))

The rest is the same:

msg_ds.push_to_hub("sherryibm/dataset_name", private=True)

That’s it! Simple isn’t it?

Hope this helped accelerate your Huggingface journey a little!

Please click like and subscribe if you like our blog — we hope to share with you all the little and big things we learned and are learning every day, and your support means a lot to us!

See you in the next blog! ^_^

Below are some useful articles that I learned from:

  1. Huggingface Load a dataset from the Hub
  2. Huggingface Create a Dataset
  3. Huggingface Share a dataset to Hub
  4. Starcoder Github
  5. Dataset HuggingFaceH4/oasst1_en notebook

--

--

TwoDataDetectives

We are two data scientists from IBM Technology — Dr Nirandi and Dr Sherry. All our opinions are our own and do not represent those of IBM.