In this post of the PyTorch Introduction, we’ll learn how to use custom datasets with PyTorch, particularly tabular, vision and text data


PyTorch is one of the hottest libraries in the Deep Learning field right now. Since ChatGPT’s release, deep learning libraries have arguably garnered the most attention among data scientists and machine learning engineers, particularly due to the current practical applications they enable.

With their extensive capability of performing complex multidimensional calculations extremely fast, these libraries changed the way we train Neural Network models. Particularly, they are extremely helpful in managing the large number of weights that these models store and optimize. Rivaling with TensorFlow (Google’s framework), PyTorch is Meta’s open-source framework, giving you the ability to train deep learning models with a very cool and practical syntax.

So far, in this PyTorch series, we’ve learned a couple of fundamentals that gave us the ability to work with this library. For example:

We’ve used few custom datasets in our examples and previous blog posts. As we progress towards training our deep learning models in this series, it’s very helpful to understand how we can integrate different datasets in the context of PyTorch. In this blog post, we’ll learn how to work with custom datasets in the library, particularly integrating three different types of data:

  • CSV Files
  • Image Data
  • Text Data

Also, we’ll get in contact with the concept of Data Batches and how to use PyTorch custom DataLoader for that effect. Some of the inspiration for this blog post comes from Zero to Mastery Pytorch Free Course — feel free to check out this awesome resource with a lot of interesting learning examples.

Let’s start!


Creating a Random Dataset

First, let’s start by creating a random dataset in Pytorch, to understand how we can work with DataLoader :

import torch
from torch.utils.data import Dataset, DataLoader 

class RandomIntDataset(Dataset):
    def __init__(self, start, stop, x, y):
        self.data = torch.randint(start, stop, (x,y))
        self.labels = torch.randint(0, 10, (x,))

    def __len__(self):
        return len(self.labels)

    def __str__(self):
        return str(torch.cat((self.data, self.labels.unsqueeze(1)), 1))

    def __getitem__(self, i):
        return self.data[i], self.labels[i]

Our RandomIntDataset creates a random torch object, with random labels. Notice how we introduce some dunder methods in the class and inherit from torch.utils.data.Dataset (this will be especially helpful, particularly when we combine our dataset with DataLoader ).

Based on the previous class, let’s create our first dataset object!

dataset = RandomIntDataset(100, 1000, 500, 10)

Again, as we are inheriting from a Pytorch base class, we can use DataLoader to create a nice iterable:

dataset_loader = DataLoader(dataset, batch_size=10, shuffle=True)

It’s super common to pass data using batches inside Neural Networks and the DataLoader constructor takes care of that by having a neat batch_size argument that we can use! As dataset_loader is an iterable, we can use next and iter to get sequential batches of our data:

data, labels = next(iter(dataset_loader))
data
Tensor format of our random dataset — Image by Author

Run a new iteration of next(iter(dataset_loader) to see how the data moves to a next batch! This will show you different data inside the data and labels objects.

Of course, this random dataset can’t really be considered a “custom dataset”. It’s very unlikely that we’ll want to work with random generated data. But, this intro was helpful to get familiar with batching and we are ready proceed and include our first csv into our pytorch pipeline!


Combining DataLoader with Custom Datasets

As we’ve seen, using random datasets is a trivial exercise. But now that we know how DataLoader and data batches work, we can use that knowledge to create a new class with a custom pytorch dataset:

import pandas as pd

class TaxiSample(Dataset):
    def __init__(self):
        super().__init__()
        df = pd.read_csv('data/taxi_data_sample.csv')
        
        features = ['passenger_count',
                    'pickup_longitude',
                    'pickup_latitude',
                    'dropoff_longitude',
                    'dropoff_latitude']
        
        target = 'trip_duration'
        
        self.features = torch.tensor(df[features].values, 
                                     dtype=torch.float32)

        self.labels = torch.tensor(df[target].values, 
                                     dtype=torch.float32)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

Note: I’m importing the libraries as the post goes by but all libraries should be imported in the start of your scripts!

The dataset we are using is a sampled version of the Taxi Trip Duration competition on Kaggle. Here, we choose to pass our csv in the init (we can also pass this as an argument to make this function more tweakable). Let’s see how this fits nicely with DataLoader:

data_taxi = TaxiSample()
dataset_loader = DataLoader(data_taxi, batch_size=20, shuffle=True)

With our batch size of 20, we can create an iterator on this dataset:

data_iterator = iter(dataset_loader)
data, labels = next(data_iterator)
Tensor format of our taxi data sample — Image by Author

Cool! These are the first 20 random examples from the dataset. In the image above we can see the features plus the corresponding label.

The cool thing is that we can now use our class to extend several features for example:

  • Add standardization to the TaxiSample
  • Perform train and test split

Can you add these transformations to the class by yourself? Using this abstraction enables us to fit an external csv with pytorch batches quite easily!


Using Image Data

In this section of the blog post, we’ll use Microsoft Research Cats. vs Dogs. dataset to showcase Pytorch‘s ability to deal with image data. Let’s start by defining our path using pathlib :

from pathlib import Path
data_path = Path(“data/dogs_cats”)

Inside the folder, I have two folders, one with image of dogs, and the other with image of cats. Let’s extract the image paths from each folder:

cats and dogs folder — Image by Author
images of cats in our cats folder — Image by Author

For this section, we’ll need torchvision , a cool extension on pytorch, with common computer vision transformations and architectures:

from torchvision import datasets, transforms

Let’s read the jpg filenames next:

image_dogs_list = list((data_path/’dogs’).glob(“*.jpg”))
image_cats_list = list((data_path/’cats’).glob(“*.jpg”))

.. and combining our lists into a single object:

image_paths = image_cats_list + image_dogs_list

Let’s see if everything is working fine by extracting a random image from our list:

import random
from PIL import Image

random.seed(20)

random_image_path = random.choice(image_paths)
image_class = random_image_path.parent.stem

img = Image.open(random_image_path)

print(f"Random image path: {random_image_path}")
print(f"Image class: {image_class}")
print(f"Image height: {img.height}") 
print(f"Image width: {img.width}")
img
Random image from our cats and dogs folder — Image by Author

Cute little doggo!

One common step to perform when transforming images into tensors is to resize images to common formats — we can use the transforms library to help us with that and build our first vision pipeline!

data_transform = transforms.Compose([
 # Resize the image
 transforms.Resize(size=(64, 64)),
 # Flip the images randomly on the horizontal — this is a step for data augmentation
 transforms.RandomHorizontalFlip(p=0.5),
 transforms.ToTensor()])

In this transforms pipeline we are doing the following:

  • Resizing the image into a 64x64 pixel size one.
  • Add a that performs a random horizontal flip — this is commonly used as a step for data augmentation.
  • Transforms the image into tensor.

Plotting one of our images and it’s transformed version side by side:

import matplotlib.pyplot as plt

def plot_transformed_images(image_paths: list, 
                            transform: transforms.Compose, 
                            n=3, 
                            seed=100):
    random.seed(seed)
    random_image_paths = random.sample(image_paths, k=n)
    for image_path in random_image_paths:
        with Image.open(image_path) as f:
            fig, ax = plt.subplots(1, 2)
            ax[0].imshow(f) 
            ax[0].set_title(f"Original Image \nSize: {f.size}")
            ax[0].axis("off") 
            transformed_image = transform(f).permute(1, 2, 0) 
            ax[1].imshow(transformed_image) 
            ax[1].set_title(f"Transformed Image \nSize: {transformed_image.shape}")
            ax[1].axis("off")
            fig.suptitle(f"Class: {image_path.parent.stem}", fontsize=16)
plot_transformed_images(image_paths, 
 transform=data_transform, 
 n=1)
Original Image and Transformed Image side-by-side— Image by Author

Cool! Let’s just see how we can access underlying tensors with transform — for example, for the first image:

image_path = image_paths[0]
data_transform(Image.open(image_path))
Tensor generated by the transform method — Image by Author

The data_transform generated a 3 channel (RGB) 64x64 tensor!

One final thing before we move on — there’s an alternative to create tensors and labels faster. We can use the handy ImageFolder to create our training data:

train_data = datasets.ImageFolder(root=data_path,
 transform=data_transform,
 target_transform=None)

With datasets.ImageFolder

class_names = train_data.classes
class_names
Output of class_names — Image by Author

The output of class_names contains the current tags (read from the folder names) for our computer vision model. train_data now contains some important metadata about our computer vision process:

train_data components — Image by Author

And now turning dataset ImageFolders into DataLoaders is quite straightforward:

train_dataloader = DataLoader(dataset=train_data, 
 batch_size=5,
 num_workers=1, 
 shuffle=True)

train_dataloader

Voilá! We have an iterable:

img, label = next(iter(train_dataloader))
print(f"Image shape: {img.shape} -> [batch_size, color_channels, height, width]")
print(f"Label shape: {label.shape}")
Batches of 5 images — Image by Author

Cool! In this batch size of 5, we can obtain the 3x64x64 tensors, representing images of 64 by 64 pixels of 3 channels (RGB).

Now we can use these data batches to train a machine learning model that recognizes images of dogs and cats! (this is actually something we will do in the next blog post of this series!)


Using Text Data

In this last bit of the post we’ll use the fetch_20newsgroups from sklearn.datasets , particularly to create string-to-integer mapping:

from sklearn.datasets import fetch_20newsgroups

Let’s load the following from newsgroups:

from sklearn.datasets import fetch_20newsgroups

categories = [
 ‘comp.os.ms-windows.misc’,
 ‘rec.sport.baseball’,
 ‘rec.sport.hockey’,
]

dataset = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, remove=('headers', 'footers', 'quotes'))
corpus = [item for item in dataset['data']]

This next function will pre-process the text:

import nltk
import re

def preprocess_text(text: str) -> str:
    '''
    Pre processes text from the input data by removing
    special characters and number.

    Returns a list of tokens
    '''
    # remove special chars and numbers
    text = re.sub("[^A-Za-z]+", " ", text)
    tokens = nltk.word_tokenize(text.lower())
    return tokens

Building a function that creates our vocab, next:

def get_vocab(training_corpus):
  # add special characters
  # padding, end of line, unknown term
  vocab = {'__PAD__': 0, '__</e>__': 1, '__UNK__': 2}
  for item in training_corpus: 
   processed_text = preprocess_text(item)
   processed_text.sort()
   for word in processed_text:
   if word not in vocab:
   vocab[word] = len(vocab) 
   return vocab  

What’s the size of our full vocab ?

vocab = get_vocab(corpus)
len(vocab)
Size of our Vocab — Image by Author

Our vocab contains a bit over 26 thousand words and each word is mapped to an integer:

Some of our vocab words — Image by Author

With this in place, we can use a function to convert our text into tensor using the string-to-integer method:

def text_to_tensor(text: str, vocab_dict: dict) -> torch.tensor:
    '''
    Pre processes text and creates integer mapping in tensor format.
    '''   
    word_l = preprocess_text(text)
        
    # initialize empty tensor
    tensor_l = [] 
    
    # take the __UNK__ value from the vocabulary 
    unk_ID = vocab_dict['__UNK__']
            
    # for each word in the lsit:
    for word in word_l:
        # take the index
        # if the word is not in vocab_dict, then assign UNK
        word_ID = vocab_dict.get(word, unk_ID)
        # append to tensor list
        tensor_l.append(word_ID)

    return torch.tensor(tensor_l)

An example of how this function converts our text to tensor is highlighted below for the first 200 words of the first piece of text:

snippet = corpus[0][0:200]
print('The text: "{}" is represented by {}'.format(snippet, text_to_tensor(snippet, vocab)))
Text and Tensor STOI format — Image by Author

Take note that we are only using string-to-integer to create the tensor. We are not applying any type of word vector approach that encapsulates meaning. This is something reserved ford word vectors (such as BERT, Word2Vec or other) where there is a mathematical logic and idea behind the tensor. In this case, the tensor is only a numeric mapping between the order of the word in the vocab and its respective placement in the text. Nevertheless, this is a good alternative on normal one-hot vectors used in text processing.


Thank you for taking the time to read this post! Using custom data within Pytorch enables us to train our own deep learning models.

In this post, we’ve checked several important topics on how to generate our own tensor data based on custom datasets:

  • Learning how to work with the Dataset constructor.
  • Understand how to work with DataLoader for batching data.
  • Load a csv within the Dataset constructor.
  • Create our own image data pipeline with transforms.Compose
  • Finally, we’ve seen an implementation of string-to-int transforming raw text to tensors.

If you would like to learn more posts on this PyTorch series, check the following links:

Resources for this post:

  • Taxi Trip Duration — https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
  • Cats and Dogs images citation: @Inproceedings (Conference){asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization,
    author = {Elson, Jeremy and Douceur, John (JD) and Howell, Jon and Saul, Jared},
    title = {Asirra: A CAPTCHA that Exploits Interest-Aligned Manual Image Categorization},
    booktitle = {Proceedings of 14th ACM Conference on Computer and Communications Security (CCS)},
    year = {2007},
    month = {October},
    publisher = {Association for Computing Machinery, Inc.},
    url = {https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/},
    edition = {Proceedings of 14th ACM Conference on Computer and Communications Security (CCS)},
    }