Huggingface Load Dataset a Percentage for Data Science

Author

Posted Oct 29, 2024

Reads 331

Close-up of a hand holding a smartphone with a loading screen displayed, showcasing technology usage.
Credit: pexels.com, Close-up of a hand holding a smartphone with a loading screen displayed, showcasing technology usage.

Loading a dataset from Hugging Face is a straightforward process, and it can be done using the `load_dataset` function.

You can load a dataset by specifying its name or identifier, and you can also load a specific version of the dataset if needed.

Loading a subset of a dataset is also possible, and this can be done by specifying a percentage value, which is what we'll be focusing on in this section.

To load a dataset with a specific percentage, you need to specify the percentage value as a float between 0 and 1.

Loading Datasets

Loading datasets is a crucial step in working with Hugging Face's load_dataset. You can create a dataset from local files, which can be in various formats such as CSV, JSON, or text files.

To load local files, you can use the `data_files` argument in `datasets.load_dataset()`, which accepts three types of inputs: a single string, a list of strings, or a dictionary mapping splits to files.

If you don't specify which split each file is related to, the provided files are assumed to belong to the train split.

Related reading: Hugging Face Local

From Local Files

Credit: youtube.com, How to Load Dataset in Google Colab

Loading datasets from local files is a convenient way to get started with your project. You can load CSV, JSON, text, and pandas pickled dataframe files using the provided generic loading scripts.

The csv script is used for loading CSV files, the json script is used for loading JSON files, the text script is used for reading text files as a line-by-line dataset, and the pandas script is used for loading pandas pickled dataframes.

If you want more control over how your files are loaded, consider writing your own loading script from scratch or adapting one of the provided scripts. This can be more flexible and simpler than using the generic scripts.

The data_files argument in datasets.load_dataset() accepts three types of inputs: a single string as the path to a single file, a list of strings as paths to a list of files, or a dictionary mapping splits names to a single file or a list of files.

Credit: youtube.com, JupyterLab Uploading Files

Here's a summary of the types of inputs the data_files argument accepts:

If you don't indicate which split each file is related to, the provided files are assumed to belong to the train split.

Hugging Face Hub

You can load a dataset from any dataset repository on the Hugging Face Hub without a loading script. This is a game-changer for data scientists and researchers.

First, create a dataset repository and upload your data files. Then, you can use datasets.load_dataset() to load the data. For example, you can load the files from this demo repository by providing the repository namespace and dataset name.

Some datasets may have more than one version, based on Git tags, branches or commits. You can specify which dataset version you want to load using the revision flag.

If you don't specify which data files to use, load_dataset will return all the data files. This can take a long time if you're loading a large dataset like C4, which is approximately 13TB of data.

You can load a specific subset of the files with the data_files parameter. The example below loads files from the C4 dataset.

JSON files are loaded directly with datasets.load_dataset().

Dataset Configuration

Credit: youtube.com, Loading a custom dataset

You can control the features of a dataset by using the features argument in datasets.load_dataset(). This allows you to override the default pre-computed features.

To specify custom features, you can create a datasets.Features instance defining the features of your dataset. This can be particularly useful when the automatically inferred features don't align with your expectations.

Selecting a Split

When you're working with datasets, selecting the right split is crucial. You can control the generated dataset split by using the split argument in datasets.load_dataset().

If you don't provide a split argument, datasets.load_dataset() will return a dictionary containing datasets for each split in the dataset. This can be really useful for exploring different splits.

You can use the split argument to build a split from only a portion of a split in absolute number of examples. For example, split='train[:10%]' will load only the first 10% of the train split.

You can also use the split argument to mix splits. For instance, split='train[:100]+validation[:100]' will create a split from the first 100 examples of the train split and the first 100 examples of the validation split. This can be helpful when you need to combine data from different splits.

Specify Features

Credit: youtube.com, How to identify categorical features in a dataset using python ?

You can add custom labels to your dataset using the datasets.ClassLabel.

The features of the dataset may not always align with your expectations, so it's good to be able to define them yourself.

You can use the datasets.Features class to define your own labels.

Specify the features argument in datasets.load_dataset() with the features you just created.

This way, you can see the custom labels you defined when you look at your dataset features.

Data Manipulation

Data manipulation can be a crucial step in working with datasets, especially when using Hugging Face's load_dataset function.

The load_dataset function returns a Dataset object, which can be manipulated using various methods. For example, you can use the filter function to filter out rows based on a condition.

Filtering out 10% of the rows means that 10% of the data will be discarded. In the context of Hugging Face's load_dataset function, this means that 10% of the data will be removed from the dataset.

Data manipulation can also involve transforming the data in some way. For instance, you can use the map function to apply a transformation to each row in the dataset.

Jay Matsuda

Lead Writer

Jay Matsuda is an accomplished writer and blogger who has been sharing his insights and experiences with readers for over a decade. He has a talent for crafting engaging content that resonates with audiences, whether he's writing about travel, food, or personal growth. With a deep passion for exploring new places and meeting new people, Jay brings a unique perspective to everything he writes.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.