Loading a dataset from Hugging Face is a straightforward process, and it can be done using the `load_dataset` function.
You can load a dataset by specifying its name or identifier, and you can also load a specific version of the dataset if needed.
Loading a subset of a dataset is also possible, and this can be done by specifying a percentage value, which is what we'll be focusing on in this section.
To load a dataset with a specific percentage, you need to specify the percentage value as a float between 0 and 1.
You might like: Huggingface Transformers Model Loading Slow
Loading Datasets
Loading datasets is a crucial step in working with Hugging Face's load_dataset. You can create a dataset from local files, which can be in various formats such as CSV, JSON, or text files.
To load local files, you can use the `data_files` argument in `datasets.load_dataset()`, which accepts three types of inputs: a single string, a list of strings, or a dictionary mapping splits to files.
If you don't specify which split each file is related to, the provided files are assumed to belong to the train split.
Related reading: Hugging Face Local
From Local Files
Loading datasets from local files is a convenient way to get started with your project. You can load CSV, JSON, text, and pandas pickled dataframe files using the provided generic loading scripts.
The csv script is used for loading CSV files, the json script is used for loading JSON files, the text script is used for reading text files as a line-by-line dataset, and the pandas script is used for loading pandas pickled dataframes.
If you want more control over how your files are loaded, consider writing your own loading script from scratch or adapting one of the provided scripts. This can be more flexible and simpler than using the generic scripts.
The data_files argument in datasets.load_dataset() accepts three types of inputs: a single string as the path to a single file, a list of strings as paths to a list of files, or a dictionary mapping splits names to a single file or a list of files.
A different take: Hugging Face Text Summarization
Here's a summary of the types of inputs the data_files argument accepts:
If you don't indicate which split each file is related to, the provided files are assumed to belong to the train split.
Hugging Face Hub
You can load a dataset from any dataset repository on the Hugging Face Hub without a loading script. This is a game-changer for data scientists and researchers.
First, create a dataset repository and upload your data files. Then, you can use datasets.load_dataset() to load the data. For example, you can load the files from this demo repository by providing the repository namespace and dataset name.
Some datasets may have more than one version, based on Git tags, branches or commits. You can specify which dataset version you want to load using the revision flag.
If you don't specify which data files to use, load_dataset will return all the data files. This can take a long time if you're loading a large dataset like C4, which is approximately 13TB of data.
You can load a specific subset of the files with the data_files parameter. The example below loads files from the C4 dataset.
JSON files are loaded directly with datasets.load_dataset().
On a similar theme: How to Load a Model in Mixed Precision in Huggingface
Dataset Configuration
You can control the features of a dataset by using the features argument in datasets.load_dataset(). This allows you to override the default pre-computed features.
To specify custom features, you can create a datasets.Features instance defining the features of your dataset. This can be particularly useful when the automatically inferred features don't align with your expectations.
Selecting a Split
When you're working with datasets, selecting the right split is crucial. You can control the generated dataset split by using the split argument in datasets.load_dataset().
If you don't provide a split argument, datasets.load_dataset() will return a dictionary containing datasets for each split in the dataset. This can be really useful for exploring different splits.
You can use the split argument to build a split from only a portion of a split in absolute number of examples. For example, split='train[:10%]' will load only the first 10% of the train split.
You can also use the split argument to mix splits. For instance, split='train[:100]+validation[:100]' will create a split from the first 100 examples of the train split and the first 100 examples of the validation split. This can be helpful when you need to combine data from different splits.
A fresh viewpoint: How to Use Models from Huggingface
Specify Features
You can add custom labels to your dataset using the datasets.ClassLabel.
The features of the dataset may not always align with your expectations, so it's good to be able to define them yourself.
You can use the datasets.Features class to define your own labels.
Specify the features argument in datasets.load_dataset() with the features you just created.
This way, you can see the custom labels you defined when you look at your dataset features.
For another approach, see: Fine-tuning Huggingface Model with Custom Dataset
Data Manipulation
Data manipulation can be a crucial step in working with datasets, especially when using Hugging Face's load_dataset function.
The load_dataset function returns a Dataset object, which can be manipulated using various methods. For example, you can use the filter function to filter out rows based on a condition.
Filtering out 10% of the rows means that 10% of the data will be discarded. In the context of Hugging Face's load_dataset function, this means that 10% of the data will be removed from the dataset.
Data manipulation can also involve transforming the data in some way. For instance, you can use the map function to apply a transformation to each row in the dataset.
Explore further: How to Use Huggingface Models in Python
Sources
Featured Images: pexels.com