Optimizing Your Deep Learning Models with Keras Pre-Processing Techniques

Keras is a high-level deep learning library that provides a number of pre-processing functions for preparing data for model training and prediction. Here are some of the pre-processing functions available in Keras:

Keras API Functions
  1. Image Pre-processing
  • ImageDataGenerator: for data augmentation, rescaling, and image normalization.
  • image_dataset_from_directory: for creating image datasets from directories on disk.
  1. Text Pre-processing
  • TextVectorization: for converting text into sequences of tokens.
  • text_dataset_from_directory: for creating text datasets from directories on disk.
  • tf.data.TextLineDataset: for creating text datasets from text files.
  1. Numeric Pre-processing
  • normalize: for normalizing numeric data.
  • StandardScaler: for scaling numeric data to have zero mean and unit variance.
  • MinMaxScaler: for scaling numeric data to a specified range.
  1. Sequence Pre-processing
  • pad_sequences: for padding sequences to a fixed length.
  • skipgrams: for generating skip-gram pairs from sequences.
  • make_sampling_table: for generating a sampling table for use in negative sampling.

These pre-processing functions can be used to prepare data for training and prediction with Keras models.

Keras Pre-Processing Functions
ImageDataGenerator

ImageDataGenerator is a class in the Keras deep learning library that provides real-time data augmentation for image data during the model training process. It generates batches of image data with various data augmentations applied to them, such as scaling, shifting, rotating, flipping, and more. These augmented images can be used to train deep learning models, which can help to improve the model’s performance and make it more robust to variations in the input data.

The ImageDataGenerator class can also be used to perform on-the-fly data normalization, such as rescaling pixel values to a range of 0 to 1, centering pixel values around the mean, and scaling them to have a unit variance. This can help to improve the convergence of the model during training.

ImageDataGenerator can be used with various Keras deep learning models, such as Sequential, Functional, and Model API models, and it is often used in conjunction with the flow() or flow_from_directory() methods to generate batches of augmented image data during training.

Example Code

Here’s an example code that shows how to use the ImageDataGenerator class in Keras to perform data augmentation on a set of images:

In this code, we create an instance of ImageDataGenerator with several data augmentation parameters, such as rotation, shifting, shearing, and flipping. We then use the flow_from_directory() method to load images from a directory and apply the data augmentation in real-time during model training.

from keras.preprocessing.image import ImageDataGenerator

# Create an instance of ImageDataGenerator with data augmentation parameters
datagen = ImageDataGenerator(
    rotation_range=20, # randomly rotate images by up to 20 degrees
    width_shift_range=0.1, # randomly shift images horizontally by up to 10%
    height_shift_range=0.1, # randomly shift images vertically by up to 10%
    shear_range=0.2, # randomly apply shearing transformations
    zoom_range=0.2, # randomly zoom in on images
    horizontal_flip=True, # randomly flip images horizontally
    fill_mode='nearest' # fill in missing pixels with nearest pixel value
)

# Load images from a directory and apply data augmentation
train_generator = datagen.flow_from_directory(
    'train_dir', # directory containing training images
    target_size=(224, 224), # resize images to 224x224 pixels
    batch_size=32, # generate batches of 32 images
    class_mode='categorical' # use categorical labels
)

You can load an image from the file path /train_dir/image.jpg as a PIL (Python Imaging Library) image using the load_img function. It then converts the PIL image to a Numpy array using the img_to_array function, which creates a Numpy array of shape (3, 150, 150) representing the red, green, and blue color channels of the image, each with a resolution of 150×150 pixels. Overall, this code demonstrates how to use Keras’ ImageDataGenerator class to perform data augmentation on an input image and generate a batch of augmented images for visualization.

img = load_img('/train_dir/ambulance/ambulance1.jpg')  # this is a PIL image
x = img_to_array(img)  # this is a Numpy array with shape (3, 150, 150)
x = x.reshape((1,) + x.shape)  # this is a Numpy array with shape (1, 3, 150, 150)

# iterator
aug_iter = datagen.flow(x, batch_size=1)

# generate samples and plot
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15,15))

# generate batch of images
for i in range(3):
	# convert to unsigned integers
	image = next(aug_iter)[0].astype('uint8')
 
	# plot image
	ax[i].imshow(image)
	ax[i].axis('off')
Ambulance Augmented Images
image_dataset_from_directory

image_dataset_from_directory is a function in the Keras preprocessing module that allows you to create a tf.data.Dataset object from a directory of image files. The function automatically infers the labels of the images based on the subdirectories in the input directory, and it can perform optional data preprocessing and augmentation on the images.

Example Code

Here’s an example code that shows how to use image_dataset_from_directory to create a tf.data.Dataset object from a directory of images:

import tensorflow as tf
from keras.utils import image_dataset_from_directory

# Define the input directory and batch size
input_dir = "path/to/input_directory"
batch_size = 32

# Create a dataset from the input directory
dataset = image_dataset_from_directory(
    input_dir,
    labels="inferred",
    batch_size=batch_size,
    image_size=(224, 224),
    validation_split=0.2,
    subset="training",
    seed=42,
    shuffle=True,
)

# Print the class names and number of classes
class_names = dataset.class_names
num_classes = len(class_names)
print("Class names:", class_names)
print("Number of classes:", num_classes)

In this code, we first define the input directory and batch size. We then call image_dataset_from_directory to create a dataset from the input directory. The labels parameter is set to “inferred” to automatically infer the labels of the images based on the subdirectories in the input directory. We also specify the image size, validation split, subset, seed, and shuffle parameters for additional configuration.

Finally, we print the class names and number of classes in the dataset, which are inferred from the subdirectories in the input directory. The resulting dataset object can be used to train a TensorFlow model, for example, by passing it to the fit() method of a Keras model.

TextVectorization

TextVectorization is a Keras layer that allows you to vectorize text data and convert it into a sequence of integers that can be fed into a deep learning model. It provides various text preprocessing and tokenization options, such as filtering out stop words, converting text to lowercase, and splitting text into individual words or subwords.

The TextVectorization layer is part of the Keras preprocessing text module, and it can be used to preprocess and vectorize text data directly within a Keras deep learning model. This can help to simplify the data preprocessing and model building process by allowing you to include the text vectorization step as part of the model architecture.

Example Code

Here’s an example code that shows how to use TextVectorization in Keras to vectorize text data:

from tensorflow.keras.layers import TextVectorization

# Define some example text data
texts = ['This is an example sentence.',
         'This is another example sentence.',
         'Yet another example sentence.']

# Create a TextVectorization layer
vectorizer = TextVectorization(max_tokens=1000, output_mode="int", output_sequence_length=10)

# Adapt the TextVectorization layer to the text data
vectorizer.adapt(texts)

# Convert the text to a sequence of integers
sequences = vectorizer(texts)

# Print the sequences
print(sequences)

In this code, we first define some example text data as a list of strings. We then create a TextVectorization layer with a maximum vocabulary size of 1000 and an output mode of “int”, which outputs integer indices that correspond to each word in the vocabulary. We also set the output_sequence_length parameter to 10, which specifies the maximum length of the output sequences.

Furthermore, we then adapt the TextVectorization layer to the text data using the adapt method, which updates the internal vocabulary and other parameters based on the input texts.

Finally, we convert the text data to a sequence of integers using the vectorizer layer as a callable, which applies the text preprocessing and tokenization steps to the input text and outputs a sequence of integers that correspond to the words in the vocabulary. The resulting sequences object is a 2D Tensor with shape (3, 10), where each row represents a sequence of integers corresponding to a single input text.

text_dataset_from_directory

text_dataset_from_directory is a function in the Keras preprocessing text module that allows you to create a tf.data.Dataset object from a directory of text files. The function automatically infers the labels of the text files based on the subdirectories in the input directory, and it can perform optional text preprocessing and vectorization on the text.

Example Code

Here’s an example code that shows how to use text_dataset_from_directory to create a tf.data.Dataset object from a directory of text files:

import tensorflow as tf
from keras.layers import TextVectorization
from keras.utils import text_dataset_from_directory

# Define the input directory and batch size
input_dir = "path/to/input_directory"
batch_size = 32

# Create a TextVectorization layer for preprocessing and vectorization
vectorizer = TextVectorization(max_tokens=1000, output_mode="int", output_sequence_length=10)

# Create a dataset from the input directory
dataset = text_dataset_from_directory(
    input_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=42,
)

# Fit the TextVectorization layer on the text data
text_batch = dataset.map(lambda x, y: x)
vectorizer.adapt(text_batch)

# Apply the TextVectorization layer to the text data
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorizer(text), label

# Map the vectorization function to the dataset
vectorized_ds = dataset.map(vectorize_text)

# Print the class names and number of classes
class_names = dataset.class_names
num_classes = len(class_names)
print("Class names:", class_names)
print("Number of classes:", num_classes)

In this code, we first define the input directory and batch size. We then create a TextVectorization layer with a maximum vocabulary size of 1000 and an output mode of “int”, which outputs integer indices that correspond to each word in the vocabulary. We also set the output_sequence_length parameter to 10, which specifies the maximum length of the output sequences.

Furthermore, we then create a dataset from the input directory using the text_dataset_from_directory function, which automatically infers the labels of the text files based on the subdirectories in the input directory. We also specify the validation_split, subset, seed, and other parameters for additional configuration.

Next, we fit the vectorizer layer on the text data using the adapt method, which updates the internal vocabulary and other parameters based on the input text. We also define a vectorization function vectorize_text that applies the vectorizer layer to each text input in the dataset and returns a tuple of the vectorized text and the label.

Finally, we map the vectorize_text function to the dataset object using the map method, which applies the function to each element in the dataset and outputs a new dataset with the vectorized text and labels. We also print the class names and number of classes in the dataset, which are inferred from the subdirectories in the input directory. The resulting vectorized_ds object can be used to train a TensorFlow model, for example, by passing it to the fit() method of a Keras model.

tf.data.TextLineDataset

tf.data.TextLineDataset is a class in TensorFlow that allows you to create a tf.data.Dataset object from one or more text files. Each line in the text file(s) is treated as a separate example in the dataset. This class can be used to build custom data pipelines for processing text data in TensorFlow.

Example Code

Here’s an example code that shows how to use tf.data.TextLineDataset to create a tf.data.Dataset object from a text file:

import tensorflow as tf

# Define the input text file and batch size
input_file = "path/to/input_file.txt"
batch_size = 32

# Create a TextLineDataset from the input file
dataset = tf.data.TextLineDataset(input_file)

# Define a function to preprocess the text
def preprocess_text(text):
    # Perform some text preprocessing here
    return text

# Apply the preprocessing function to the text data
dataset = dataset.map(preprocess_text)

# Batch and shuffle the dataset
dataset = dataset.batch(batch_size)
dataset = dataset.shuffle(buffer_size=10000)

# Print the first batch of data
for batch in dataset.take(1):
    print(batch)

In this code, we first define the input text file and batch size. We then create a TextLineDataset object from the input file using the tf.data.TextLineDataset class. Each line in the text file is treated as a separate example in the dataset.

We then define a function preprocess_text that performs some text preprocessing on the input text. This function can be customized to perform any necessary text preprocessing steps, such as tokenization, filtering out stop words, or converting text to lowercase.

Next, we apply the preprocess_text function to the text data using the map method of the dataset object. This creates a new dataset with the preprocessed text data.

We then batch and shuffle the dataset using the batch and shuffle methods of the dataset object. Finally, we print the first batch of data using the take method of the dataset object.

Note that tf.data.TextLineDataset is a powerful class that can be used to build custom data pipelines for text data in TensorFlow. It can be used in combination with other TensorFlow datasets and preprocessing functions to create complex data processing pipelines.

normalize

normalize is a function in the Keras preprocessing module that allows you to normalize the data in a dataset along a specific axis. Normalization is a common preprocessing step in machine learning that can improve the performance and stability of many algorithms. In Keras, normalize can be used to preprocess the data before training a deep learning model.

Example Code

Here’s an example code that shows how to use normalize in Keras to normalize the features of a dataset:

from keras.utils import load_img, img_to_array, normalize

# Load an example image
img = load_img('path/to/image.jpg')

# Convert the image to a Numpy array
x = img_to_array(img)

# Normalize the image along the channel axis
x_norm = normalize(x, axis=-1)

# Print the original and normalized data
print("Original data:\n", x)
print("Normalized data:\n", x_norm)

In this code, we first load an example image using the load_img function of the Keras preprocessing image module. We then convert the image to a Numpy array using the img_to_array function, which converts the PIL image object to a Numpy array with shape (height, width, channels).

Next, we normalize the data along the channel axis using the normalize function of the Keras preprocessing module. The axis parameter specifies the axis along which to normalize the data. In this case, we set it to -1, which normalizes the data along the channel axis.

Finally, we print the original and normalized data using the print function.

Note that normalize can be used with other Keras preprocessing functions, such as reshape or scale, to preprocess the data in different ways. Additionally, normalize it can be used in combination with a Keras deep learning model as part of a pipeline to preprocess and normalize the data before training the model.

pad_sequences

pad_sequences is a function in the Keras preprocessing module that allows you to pad sequences to a specified length. Padding is a common preprocessing step in natural language processing that can be used to standardize the length of text sequences in a dataset. In Keras, pad_sequences can be used to preprocess the text data before training a deep learning model.

Example Code

Here’s an example code that shows how to use pad_sequences in Keras to pad sequences of text data:

import tensorflow as tf

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Define some example text data
texts = ['this is a sentence',
         'this is another sentence',
         'a third sentence',
         'yet another sentence']

# Create a Tokenizer object
tokenizer = Tokenizer(num_words=1000)

# Fit the Tokenizer on the text data
tokenizer.fit_on_texts(texts)

# Convert the text data to sequences of integer indices
sequences = tokenizer.texts_to_sequences(texts)

# Pad the sequences to a maximum length of 6
padded_sequences = pad_sequences(sequences, maxlen=6, padding='post')

# Print the original and padded sequences
print("Original sequences:\n", sequences)
print("Padded sequences:\n", padded_sequences)

# Output
Original sequences:
 [[2, 3, 4, 1], [2, 3, 5, 1], [4, 6, 1], [7, 5, 1]]
Padded sequences:
 [[2 3 4 1 0 0]
 [2 3 5 1 0 0]
 [4 6 1 0 0 0]
 [7 5 1 0 0 0]]

In this code, we first define some example text data as a list of strings. We then create a Tokenizer object and fit it on the text data using the fit_on_texts method, which updates the internal vocabulary of the tokenizer with the words in the text data.

Next, we convert the text data to sequences of integer indices using the texts_to_sequences method of the tokenizer object. This creates a list of sequences, where each sequence represents a text string as a list of integer indices corresponding to the words in the vocabulary.

Finally, we pad the sequences to a maximum length of 6 using the pad_sequences function of the Keras preprocessing module. The maxlen parameter specifies the maximum length of the sequences, and the padding parameter specifies whether to add padding at the beginning or end of each sequence. The resulting padded_sequences object is a 2D Numpy array with the same number of rows as the original sequences, but with padded values at the end of each sequence.

Finally, we print the original and padded sequences using the print function. Note that pad_sequences can be used in combination with other Keras preprocessing functions, such as Tokenizer, to preprocess the text data in different ways. Additionally, pad_sequences it can be used in combination with a Keras deep learning model as part of a pipeline to preprocess and pad the data before training the model.

skipgrams

skipgrams is a function in the Keras preprocessing module that generates skip-gram pairs from text data. Skip-grams are a popular method for training word embeddings, which are dense vector representations of words that can capture semantic and syntactic relationships between them. In Keras, skipgrams can be used to preprocess the text data before training a deep learning model.

Example Code

Here’s an example code that shows how to use skipgrams in Keras to generate skip-gram pairs from text data:

import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import skipgrams

# Define some example text data
texts = ['this is a sentence',
         'this is another sentence',
         'a third sentence',
         'yet another sentence']

# Create a Tokenizer object
tokenizer = Tokenizer(num_words=1000)

# Fit the Tokenizer on the text data
tokenizer.fit_on_texts(texts)

# Convert the text data to sequences of integer indices
sequences = tokenizer.texts_to_sequences(texts)

# Generate skip-gram pairs from the sequences
skip_grams = skipgrams(sequences, vocabulary_size=1000, window_size=2, negative_samples=5)

# Print the skip-gram pairs
print("Skip-grams:\n", skip_grams)

# Output
Skip-grams:
 ([[[2, 3, 4, 1], 759], [[2, 3, 5, 1], 737], [[7, 5, 1], 368], [[4, 6, 1], 393], [[2, 3, 4, 1], 759], [[2, 3, 5, 1], 525], [[4, 6, 1], ...

In this code, we first define some example text data as a list of strings. We then create a Tokenizer object and fit it on the text data using the fit_on_texts method, which updates the internal vocabulary of the tokenizer with the words in the text data.

Next, we convert the text data to sequences of integer indices using the texts_to_sequences method of the tokenizer object. This creates a list of sequences, where each sequence represents a text string as a list of integer indices corresponding to the words in the vocabulary.

Finally, we generate skip-gram pairs from the sequences using the skipgrams function of the Keras preprocessing module. The vocabulary_size parameter specifies the size of the vocabulary, the window_size parameter specifies the size of the skip-gram window, and the negative_samples parameter specifies the number of negative samples to generate for each positive pair.

The resulting skip_grams object is a tuple of two lists: the first list contains the input words in the skip-gram pairs, and the second list contains the context words in the skip-gram pairs.

Finally, we print the skip-gram pairs using the print function.

Note that skipgrams can be used in combination with other Keras preprocessing functions, such as Tokenizer, to preprocess the text data in different ways. Additionally, skipgrams it can be used in combination with a Keras deep learning model as part of a pipeline to preprocess and generate skip-grams from the data before training the model.

Notice

Deprecated:tf.keras.preprocessing APIs do not operate on tensors and are not recommended for new code. Prefer loading data with either tf.keras.utils.text_dataset_from_directory or tf.keras.utils.image_dataset_from_directory, and then transforming the output tf.data.Dataset with preprocessing layers. These approaches will offer better performance and integration with the broader TensorFlow ecosystem. For more information, see the tutorials for loading textloading images, and augmenting images, as well as the preprocessing layer guide.

Leave a Reply

Your email address will not be published. Required fields are marked *