Hugging Face's dataset library is a treasure trove of pre-trained models and datasets that can significantly boost your AI projects' efficiency.
Their datasets are sourced from various places, including web scraping, user contributions, and partnerships with other organizations.
The library includes a wide range of datasets, from text classification to computer vision, making it a one-stop-shop for many AI tasks.
One notable example is the IMDB dataset, which contains 50,000 movie reviews for sentiment analysis tasks.
Explore further: Huggingface Vertex Ai
Dataset Basics
When working with a large dataset, it's essential to consider how to efficiently process it. You can use an iterator to feed the dataset to the pipeline, which will automatically recognize the input is iterable and start fetching the data while processing it on the GPU.
The iterator data() yields each result, and this approach is crucial because it allows you to feed the GPU as fast as possible without having to allocate memory for the whole dataset. This makes a significant difference in terms of speed and efficiency.
For more insights, see: Fastapi Huggingface Gpu
Iterable Dataset
An IterableDataset is a great way to load large datasets, especially when you don't want to wait for the whole thing to download.
You can create an IterableDataset from an existing Dataset, which is faster than streaming mode because it streams from local files.
Iterating over an IterableDataset is a bit different than with a regular Dataset - you can't get random access to examples, so you need to iterate over its elements, such as by calling next(iter()) or with a for loop.
You can return a subset of the dataset with a specific number of examples in it with IterableDataset.take(), but this creates a new IterableDataset.
Using an iterator is a great way to run inference on a large dataset, as it allows the pipeline to fetch data while processing it on the GPU, without having to allocate memory for the whole dataset.
Suggestion: Huggingface Examples
Batch Size
Batching can be a bit of a tricky topic, but don't worry, I've got the lowdown.
By default, pipelines won't batch inference, and that's because batching isn't always faster, and can even be slower in some cases.
Batching can speed things up, so it's worth trying to tune the batch_size parameter to see if it makes a difference.
Pipelines can also handle batching for you, which can be a big help, especially when working with large datasets.
You can run a pipeline on a large dataset using an iterator, which yields each result, and the pipeline will automatically recognize the input is iterable and start fetching the data while processing it on the GPU.
Batching can be especially useful when working with GPUs, as it can help feed the GPU as fast as possible.
The batch_size parameter can be tuned to try and speed things up, but it's not always necessary.
Pipelines can also alleviate some of the complexities of batching, such as chunking a single item into multiple parts to be processed by a model.
Hugging Face Features
The Hugging Face feature extraction pipeline is a powerful tool that extracts hidden states from a base transformer, which can be used as features in downstream tasks.
You can load this pipeline using the task identifier "feature-extraction" from the pipeline() function. This pipeline uses no model head, making it a great choice for tasks where you need to extract features from text data.
To use the feature extraction pipeline, you'll need to specify the model and tokenizer that will be used to make predictions. This can be done by passing a PreTrainedModel or TFPreTrainedModel object as the model argument, and a PreTrainedTokenizer object as the tokenizer argument.
Here are some key arguments you can pass to the pipeline() function:
- args: One or several texts (or one list of texts) to get the features of.
- num_workers: The number of workers to use when loading data (default is 8).
- batch_size: The size of the batch to use (default is 1).
By using the Hugging Face feature extraction pipeline, you can easily extract features from text data and use them in downstream tasks.
Accelerating Large Models with Hugging Face
Accelerating Large Models with Hugging Face is a game-changer. You can easily run pipeline on large models using 🤗 accelerate.
Take a look at this: Ollama Huggingface
First, make sure you have installed accelerate with pip install accelerate. This is a crucial step to get started.
To load your model, use device_map="auto"! For example, we'll use facebook/opt-1.3b for our example. This model is a great choice for large model loading.
You can also pass 8-bit loaded models if you install bitsandbytes and add the argument load_in_8bit=True. This can help with memory efficiency.
Note that you can replace the checkpoint with any Hugging Face model that supports large model loading, such as BLOOM. This gives you a lot of flexibility.
Curious to learn more? Check out: How to Use Hugging Face Models
The Abstraction
The pipeline abstraction is a powerful tool that simplifies working with pipelines. It's a wrapper around all the other available pipelines, making it easy to use.
You can instantiate it just like any other pipeline, and it provides additional quality of life features. If you want to use a specific model from the hub, you can ignore the task if the model on the hub already defines it.
Worth a look: How to Use Huggingface Models in Python
To call a pipeline on many items, you can simply pass a list to it. This makes it easy to process large datasets without having to write custom loops or do batching yourself.
A generator is also possible, making it easy to iterate over full datasets without having to allocate the whole dataset at once. This should work just as fast as custom loops on GPU.
Here are some ways to create a pipeline:
- Subclass your pipeline of choice
- Use the utility factory method to build a Pipeline
Chunk Batching
Chunk batching is a feature that helps with processing large inputs, like long audio files. It breaks down the input into smaller chunks, which can then be processed by a model.
Batching can be slower in some cases, so pipelines don't do it by default. However, if you want to batch inference, you can do so without requiring any extra code.
Pipelines can automatically chunk batch large inputs, which is a big help. This means you don't have to worry about how to break down the input into smaller parts.
Some pipelines, like zero-shot-classification and question-answering, are special because they can trigger multiple forward passes of a model. To handle this, they use a ChunkPipeline instead of a regular Pipeline.
The ChunkPipeline is used in the same way as a regular Pipeline, so you don't have to change your code. It can automatically handle the batching for you, which is a big convenience.
Automatic Speech Recognition
Automatic Speech Recognition is a crucial aspect of keydataset Hugging Face, and it's made possible by the ZeroShotAudioClassificationPipeline. This pipeline uses a model like ClapModel to predict the class of an audio when you provide an audio and a set of candidate labels.
The pipeline can handle three types of inputs, including audios, which can be a string, a list of strings, a numpy array, or a list of numpy arrays. You can also provide candidate labels, which will be formatted using the hypothesis template.
The default hypothesis template is "This is a sound of {}." but you can update it for your usage. You can also specify the number of workers to use when passing a dataset, which defaults to 8.
Here are the possible inputs for the pipeline:
- Audios: str, List[str], np.array, or List[np.array]
- Candidate labels: List[str]
To use the pipeline, you'll need to specify the model, tokenizer, and feature extractor. The model can be a PreTrainedModel or TFPreTrainedModel, and the tokenizer and feature extractor should inherit from PreTrainedTokenizer and SequenceFeatureExtractor, respectively.
A unique perspective: Create Feature for Dataset Huggingface
Image and Video
Huggingface's key dataset includes pipelines for image and video classification. You can use the ImageClassificationPipeline to predict the class of an image, which accepts a single image or a batch of images as input.
The pipeline can handle three types of images: HTTP(S) links, local paths, or PIL images. It also accepts a model, tokenizer, and framework, which can be specified or defaulted.
The ImageClassificationPipeline returns a dictionary or list of dictionaries containing the result, with keys for label and score. For example, if the input is a single image, it will return a dictionary with the label and score.
Here's an interesting read: Dataset Huggingface Modify Class Label
You can use the VideoClassificationPipeline to predict the class of a video, which accepts a single video or a batch of videos as input. The pipeline can handle three types of videos: HTTP(S) links, local paths, or strings. It also accepts parameters for top_k, num_frames, frame_sampling_rate, and function_to_apply.
The VideoClassificationPipeline returns a dictionary or list of dictionaries containing the result, with keys for label and score. For example, if the input is a single video, it will return a dictionary with the label and score.
You can use the ImageSegmentationPipeline to predict masks of objects and their classes in an image. The pipeline accepts a single image or a batch of images as input, and can handle three types of images: HTTP(S) links, local paths, or PIL images. It also accepts parameters for subtask, threshold, mask_threshold, and overlap_mask_area_threshold.
The ImageSegmentationPipeline returns a dictionary or list of dictionaries containing the result, with keys for mask and class. For example, if the input is a single image, it will return a dictionary with the mask and class.
Take a look at this: Huggingface Local Llm
Image
Image processing is a crucial aspect of image and video analysis. There are various pipelines available for image classification, feature extraction, and object detection.
You can use the ZeroShotImageClassificationPipeline for zero-shot image classification, which predicts the class of an image when you provide an image and a set of candidate labels. This pipeline can be loaded from pipeline() using the task identifier "zero-shot-image-classification".
The pipeline handles three types of images: HTML, PIL.Image, or List[PIL.Image]. You can also specify a hypothesis template to format the candidate labels.
Image feature extraction can be achieved using the ImageFeatureExtractionPipeline, which extracts the hidden states from the base transformer. This pipeline can be loaded from pipeline() using the task identifier "image-feature-extraction".
The ImageClassificationPipeline is another option for image classification, using any AutoModelForImageClassification. This pipeline predicts the class of an image and can be loaded from pipeline() using the task identifier "image-classification".
For object detection, you can use the ObjectDetectionPipeline, which predicts bounding boxes of objects and their classes. This pipeline can be loaded from pipeline() using the task identifier "object-detection".
Here's a summary of the image pipelines available:
These pipelines offer various features and capabilities for image analysis, and can be used to achieve specific tasks depending on your needs.
Video
Video classification is a powerful tool that can be used to assign labels to videos. This can be done using the classtransformers.VideoClassificationPipeline, which can be loaded from pipeline() using the task identifier "video-classification".
The pipeline can handle three types of videos: a single video or a batch of videos, which must be passed as a string. Videos in a batch must all be in the same format, either as http links or as local paths.
The pipeline accepts inputs in the form of a string, which can be a single video or a batch of videos. The inputs can be passed as a single string or as a list of strings.
The top_k parameter determines the number of top labels that will be returned by the pipeline. If the provided number is higher than the number of labels available in the model configuration, it will default to the number of labels. By default, the pipeline returns the top 5 labels.
The num_frames parameter determines the number of frames sampled from the video to run the classification on. If not provided, it will default to the number of frames specified in the model configuration.
The frame_sampling_rate parameter determines the sampling rate used to select frames from the video. If not provided, it will default to 1, meaning every frame will be used.
The function_to_apply parameter determines the function to apply to the model output. By default, the pipeline will apply the softmax function to the output of the model.
Sources
- https://huggingface.co/docs/datasets/en/access
- https://huggingface.co/docs/transformers/en/pipeline_tutorial
- https://huggingface.co/docs/transformers/en/main_classes/pipelines
- https://medium.com/@lokaregns/text-summarization-with-hugging-face-transformers-a-beginners-guide-9e6c319bb5ed
- https://huggingface.co/transformers/v4.10.1/main_classes/pipelines.html
Featured Images: pexels.com