How to Create a Huggingface Dataset from Scratch

Author

Posted Nov 2, 2024

Reads 382

Crop unrecognizable person surfing internet on contemporary netbook and typing on keyboard in dark workspace
Credit: pexels.com, Crop unrecognizable person surfing internet on contemporary netbook and typing on keyboard in dark workspace

Creating a Huggingface dataset from scratch requires some setup, but it's not as daunting as it sounds. You'll need to create a new repository on GitHub and initialize a new dataset directory.

To start, you'll need to install the Huggingface Datasets library, which you can do by running `pip install datasets`. This library provides a simple and consistent interface for loading and manipulating datasets.

Creating a new dataset involves defining the schema, which is the structure of your dataset. This includes the types of data you'll be storing, such as text or images, and the format of that data.

For your interest: Creating Ram Drives

Creating a Huggingface Dataset

You can create a dataset with Huggingface Datasets by using their low-code methods, which can save you a lot of time.

To start, you'll need to create a dataset class as a subclass of GeneratorBasedBuilder. This class has three methods: info, split_generators, and generate_examples. These methods will help you create your dataset.

Here's an interesting read: Dataset Huggingface Modify Class Label

Credit: youtube.com, Hugging Face Datasets #1 | Hosting Your Datasets (for Beginners)

The info method stores information about your dataset, such as its description, license, and features. This is where you'll provide details about your dataset.

The split_generators method downloads the dataset and defines its splits. This is where you'll specify how your dataset will be divided.

The generate_examples method generates the images and labels for each split. This is where you'll create the actual data for your dataset.

Here's a quick rundown of what you'll need to create a dataset with Huggingface Datasets:

  • Folder-based builders for quickly creating an image or audio dataset
  • from_ methods for creating datasets from local files

By using these low-code methods, you can easily and rapidly create a dataset with Huggingface Datasets.

Data Preparation

After defining the attributes of your dataset, the next step is to download the data files and organize them according to their splits.

You can download the data files by using the DownloadManager.download_and_extract() function, which takes a dictionary of URLs or relative paths to the files. If the data files live in the same folder or repository of the dataset script, you can just pass the relative paths to the files instead of URLs.

Credit: youtube.com, Creating Your First Hugging Face Dataset

Once the files are downloaded, use SplitGenerator to organize each split in the dataset. This is a simple class that contains the logic for splitting the data into train, validation, and test sets.

Here's a summary of the steps to download and organize the data files:

  1. Download the data files using DownloadManager.download_and_extract().
  2. Use SplitGenerator to organize each split in the dataset.

Upload the

To upload your dataset to the Hub, make sure you have the huggingface_hub library installed. Install it if you haven't already.

First, create a dataset card, which is a necessary step before uploading your dataset. This card will serve as a description of your dataset.

Once your script is ready, you can create a dataset card and upload it to the Hub. Make sure you're logged in to your Hugging Face account for this process.

You can upload your dataset with the push_to_hub() method, which is a specific method provided by the huggingface_hub library. This method allows you to share your dataset easily.

After uploading your dataset, you can load it from the Hub, which means you can access it from anywhere and use it for your projects.

Download Data Files

Credit: youtube.com, Data Preparation for Social Network Analysis

To download data files, you can use the DownloadManager.download_and_extract() function, which takes a dictionary of URLs pointing to the original data files. This function will download the files once you've defined the attributes of your dataset.

If the data files live in the same folder or repository as the dataset script, you can pass relative paths to the files instead of URLs. This makes it easy to access the files without having to specify their full URLs.

The DownloadManager.download_and_extract() function is a convenient way to download data files, especially if you have multiple files to download. It's also a good idea to use this function when working with large datasets to avoid manual downloading and extracting of files.

Once the files are downloaded, you can use the SplitGenerator to organize each split in the dataset. This is a simple class that contains the necessary logic to split the data into different categories.

From Python Dictionaries

Credit: youtube.com, Python dictionaries are easy 📙

When working with data, it's often necessary to prepare it for analysis or modeling. One way to do this is by creating a dataset from Python dictionaries. You can use the from_dict() method to create a dataset from a dictionary.

This method is a straightforward way to create a dataset from a dictionary, as seen in the example where a dataset is created from a dictionary containing lists of Pokémon and their types.

To create an image or audio dataset, you can chain the cast_column() method with from_dict() and specify the column and feature type. For example, to create an audio dataset, you would use the following code: audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio()).

You can also use the from_generator() method to create a dataset from a generator, which is especially useful when working with large datasets that may not fit in memory. This method generates the dataset on disk progressively and then memory-maps it, making it a more memory-efficient way to create a dataset.

Credit: youtube.com, Python Tutorial for Beginners 5: Dictionaries - Working with Key-Value Pairs

A generator-based IterableDataset needs to be iterated over with a for loop, as seen in the example where the generator is iterated over with a for loop to print each example.

Here are the main methods for creating a dataset from Python dictionaries:

  • from_dict(): creates a dataset from a dictionary
  • from_generator(): creates a dataset from a generator
  • cast_column(): casts a column to a specific feature type

Data Verification

Data Verification is a crucial step in the data preparation process. It ensures that your dataset is accurate and reliable.

Testing data and checksum metadata should be added to your dataset to verify its behavior. This is especially important for datasets stored in the GitHub repository of the 🤗 Datasets library, where they are mandatory.

Make sure to run all commands from the root of your local datasets repository to ensure accurate verification.

Samples

Generating samples is a crucial step in data preparation. It involves loading the data files and extracting the columns.

To load the data files, you need to write a function that uses the file path provided by gen_kwargs to read and parse the data files. This function should be able to handle the data files and extract the columns.

Credit: youtube.com, How is data prepared for machine learning?

The function should yield a tuple of an id_ and an example from the dataset. This means that for each data file, the function should return a unique id and a sample from the dataset.

Here are the steps to generate samples:

  • Load the data files using the file path provided by gen_kwargs.
  • Extract the columns from the data files.
  • Yield a tuple of an id_ and an example from the dataset for each data file.

Defining Dataset Structure

To create a dataset with 🤗 Datasets, you'll need to define its structure. This involves specifying the attributes of your dataset, such as its description, features, homepage, and citation. You can do this by adding the necessary information in the DatasetBuilder._info() method.

The most important attributes to specify are the dataset description, features, homepage, and citation. The dataset description should provide a concise overview of what's in the dataset and how it was collected. The features define the name and type of each column in your dataset, which will also provide the structure for each example.

Here are some key attributes to include in your dataset structure:

  • DatasetInfo.description: A concise description of your dataset.
  • DatasetInfo.features: Defines the name and type of each column in your dataset.
  • DatasetInfo.homepage: The URL to the dataset homepage.
  • DatasetInfo.citation: A BibTeX citation for the dataset.

For example, the SQuAD loading script includes the following example of a filled-out template:

Create a Class

Wooden letter tiles spelling 'DATA' on a wood textured surface, symbolizing data concepts.
Credit: pexels.com, Wooden letter tiles spelling 'DATA' on a wood textured surface, symbolizing data concepts.

Creating a class is a crucial step in defining your dataset structure. You'll want to create a dataset class as a subclass of GeneratorBasedBuilder, which is the base class for datasets generated from a dictionary generator.

To get started, you'll need to add three methods to your dataset class: info, split_generators, and generate_examples. These methods will help you create your dataset, but don't worry too much about filling them in just yet - you'll develop those over the next few sections.

The info method will store information about your dataset, such as its description, license, and features. The split_generators method will download the dataset and define its splits. The generate_examples method will generate the images and labels for each split.

Here are the three methods you'll need to add to your dataset class:

  • info
  • split_generators
  • generate_examples

These methods will form the foundation of your dataset class, and will help you create a robust and well-structured dataset. By following these steps, you'll be well on your way to defining a dataset structure that meets your needs.

Add Attributes

Credit: youtube.com, Database Design: Product Attributes

Adding attributes to your dataset is a crucial step in defining its structure. You'll want to include a description of your dataset, which should inform users what's in the dataset, how it was collected, and how it can be used for a NLP task.

The description should be concise, so keep it brief. The SQuAD loading script provides a good example of this.

To define the structure of your dataset, you'll need to specify the name and type of each column. This is done using the Features class, which provides a full list of feature types you can use.

Features can be nested, allowing you to create subfields in a column if needed. This can be useful for organizing complex data.

Some important attributes to include in your dataset are its homepage and citation. The homepage should contain a link to the dataset's homepage, while the citation should be a BibTeX citation for the dataset.

Credit: youtube.com, Database Design for Custom Fields

Here are the key attributes to include in your dataset:

By including these attributes, you'll make it easier for users to understand your dataset and use it for their own purposes.

Data Organization and Splitting

To organize your dataset, you can use the SplitGenerator to organize the images and labels in each split. Name each split with a standard name like Split.TRAIN, Split.TEST, and Split.VALIDATION.

When downloading data files, you can use DownloadManager.download_and_extract() to download the data files and organize them according to their splits. This method takes a dictionary of URLs pointing to the original data files.

If your dataset lives in the same folder or repository as the dataset script, you can pass the relative paths to the files instead of URLs. In this case, use DownloadManager.download_and_extract() to download and extract the data files.

Here are the standard names for dataset splits:

  • Split.TRAIN
  • Split.TEST
  • Split.VALIDATION

Image Folder

The Image Folder is a powerful tool for quickly loading an image dataset without requiring you to write any code.

Credit: youtube.com, Tutorial: How to Automatically Split Your Data (in Folders) Using Python

You can store your dataset in a directory structure like "class1/class1_image1.jpg", "class1/class1_image2.jpg", "class2/class2_image1.jpg", and so on, where the directory name automatically infers the class labels.

To load your dataset, simply specify "imagefolder" in load_dataset() and the directory in data_dir. For example, if your dataset is stored in "/path/to/dataset", you can load it by calling load_dataset("imagefolder", data_dir="/path/to/dataset").

If you have multiple splits in your dataset, you can store them in a single directory with a structure like "train/class1/class1_image1.jpg", "train/class1/class1_image2.jpg", "test/class2/class2_image1.jpg", and so on.

You can also include additional information about your dataset, such as text captions or bounding boxes, by adding a metadata.jsonl file in your folder. This file must have a file_name column that links image files with their metadata.

Here's an example of what the metadata.jsonl file might look like:

```

{

"file_name": "class1/class1_image1.jpg",

"caption": "A cat sitting on a mat",

"bounding_boxes": [

{

"x": 10,

"y": 20,

"w": 30,

"h": 40

}

]

}

```

Note that if metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set drop_labels=False in load_dataset().

Folder-Based Builders

Credit: youtube.com, Knowledge clip: Keeping research data organized

Folder-based builders are a great way to quickly create an image or audio dataset with several thousand examples without requiring much code. They automatically generate the dataset's features, splits, and labels.

ImageFolder is a dataset builder designed to quickly load an image dataset with several thousand images. It uses the Image feature to decode an image file, and many image extension formats are supported, such as jpg and png.

The dataset splits are generated from the repository structure, and the label names are automatically inferred from the directory name. For example, if your image dataset is stored in a directory structure like this:

Then ImageFolder generates an example like this:

Create the image dataset by specifying imagefolder in load_dataset():

AudioFolder is similar to ImageFolder but is used for audio datasets. It uses the Audio feature to decode an audio file, and audio extensions such as wav and mp3 are supported.

To include additional information about your dataset, like text captions or transcriptions, add a metadata.csv file in your folder. The metadata file needs to have a file_name column that links the image or audio file to its corresponding metadata.

Web

Credit: youtube.com, Data Organization Made Easy: Splitting Text in Excel

The WebDataset format is a great way to organize big image datasets. It's based on TAR archives, which can be used to group images in a way that's easy to manage.

You can have thousands of TAR archives, each containing 1GB of images. This makes it simple to store and access large datasets.

Each example in the archives is made up of files that share the same prefix. For instance, if you have a TAR archive with images and their corresponding labels, the files would share the same prefix.

You can use JSON or text files to store labels, captions, or bounding boxes. This makes it easy to add metadata to your images.

Loading a WebDataset will automatically create columns for each file suffix. For example, if you have "jpg" and "json" files, you'll get two columns.

Frequently Asked Questions

How to create an empty Hugging Face dataset?

To create an empty Hugging Face dataset, you can use the from_dict() method with an empty dictionary. Alternatively, you can use the from_generator() method with a generator that yields no data.

Carrie Chambers

Senior Writer

Carrie Chambers is a seasoned blogger with years of experience in writing about a variety of topics. She is passionate about sharing her knowledge and insights with others, and her writing style is engaging, informative and thought-provoking. Carrie's blog covers a wide range of subjects, from travel and lifestyle to health and wellness.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.