Keras is a high-level deep learning library that provides a number of pre-processing functions for preparing data for model training and prediction. Here are some of the pre-processing functions available in Keras:

- Image Pre-processing
ImageDataGenerator
: for data augmentation, rescaling, and image normalization.image_dataset_from_directory
: for creating image datasets from directories on disk.
- Text Pre-processing
: for converting text into sequences of tokens.TextVectorization
text_dataset_from_directory
: for creating text datasets from directories on disk.tf.data.TextLineDataset
: for creating text datasets from text files.
- Numeric Pre-processing
normalize
: for normalizing numeric data.StandardScaler
: for scaling numeric data to have zero mean and unit variance.MinMaxScaler
: for scaling numeric data to a specified range.
- Sequence Pre-processing
pad_sequences
: for padding sequences to a fixed length.skipgrams
: for generating skip-gram pairs from sequences.make_sampling_table
: for generating a sampling table for use in negative sampling.
These pre-processing functions can be used to prepare data for training and prediction with Keras models.
Keras Pre-Processing Functions
ImageDataGenerator
ImageDataGenerator
is a class in the Keras deep learning library that provides real-time data augmentation for image data during the model training process. It generates batches of image data with various data augmentations applied to them, such as scaling, shifting, rotating, flipping, and more. These augmented images can be used to train deep learning models, which can help to improve the model’s performance and make it more robust to variations in the input data.
The ImageDataGenerator
class can also be used to perform on-the-fly data normalization, such as rescaling pixel values to a range of 0 to 1, centering pixel values around the mean, and scaling them to have a unit variance. This can help to improve the convergence of the model during training.
ImageDataGenerator
can be used with various Keras deep learning models, such as Sequential
, Functional
, and Model
API models, and it is often used in conjunction with the flow()
or flow_from_directory()
methods to generate batches of augmented image data during training.
Example Code
Here’s an example code that shows how to use the ImageDataGenerator
class in Keras to perform data augmentation on a set of images:
In this code, we create an instance of ImageDataGenerator
with several data augmentation parameters, such as rotation, shifting, shearing, and flipping. We then use the flow_from_directory()
method to load images from a directory and apply the data augmentation in real-time during model training.
from keras.preprocessing.image import ImageDataGenerator
# Create an instance of ImageDataGenerator with data augmentation parameters
datagen = ImageDataGenerator(
rotation_range=20, # randomly rotate images by up to 20 degrees
width_shift_range=0.1, # randomly shift images horizontally by up to 10%
height_shift_range=0.1, # randomly shift images vertically by up to 10%
shear_range=0.2, # randomly apply shearing transformations
zoom_range=0.2, # randomly zoom in on images
horizontal_flip=True, # randomly flip images horizontally
fill_mode='nearest' # fill in missing pixels with nearest pixel value
)
# Load images from a directory and apply data augmentation
train_generator = datagen.flow_from_directory(
'train_dir', # directory containing training images
target_size=(224, 224), # resize images to 224x224 pixels
batch_size=32, # generate batches of 32 images
class_mode='categorical' # use categorical labels
)
You can load an image from the file path /train_dir/image.jpg
as a PIL (Python Imaging Library) image using the load_img
function. It then converts the PIL image to a Numpy array using the img_to_array
function, which creates a Numpy array of shape (3, 150, 150)
representing the red, green, and blue color channels of the image, each with a resolution of 150×150 pixels. Overall, this code demonstrates how to use Keras’ ImageDataGenerator
class to perform data augmentation on an input image and generate a batch of augmented images for visualization.
img = load_img('/train_dir/ambulance/ambulance1.jpg') # this is a PIL image
x = img_to_array(img) # this is a Numpy array with shape (3, 150, 150)
x = x.reshape((1,) + x.shape) # this is a Numpy array with shape (1, 3, 150, 150)
# iterator
aug_iter = datagen.flow(x, batch_size=1)
# generate samples and plot
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15,15))
# generate batch of images
for i in range(3):
# convert to unsigned integers
image = next(aug_iter)[0].astype('uint8')
# plot image
ax[i].imshow(image)
ax[i].axis('off')

image_dataset_from_directory
image_dataset_from_directory
is a function in the Keras preprocessing module that allows you to create a tf.data.Dataset
object from a directory of image files. The function automatically infers the labels of the images based on the subdirectories in the input directory, and it can perform optional data preprocessing and augmentation on the images.
Example Code
Here’s an example code that shows how to use image_dataset_from_directory
to create a tf.data.Dataset
object from a directory of images:
import tensorflow as tf
from keras.utils import image_dataset_from_directory
# Define the input directory and batch size
input_dir = "path/to/input_directory"
batch_size = 32
# Create a dataset from the input directory
dataset = image_dataset_from_directory(
input_dir,
labels="inferred",
batch_size=batch_size,
image_size=(224, 224),
validation_split=0.2,
subset="training",
seed=42,
shuffle=True,
)
# Print the class names and number of classes
class_names = dataset.class_names
num_classes = len(class_names)
print("Class names:", class_names)
print("Number of classes:", num_classes)
In this code, we first define the input directory and batch size. We then call image_dataset_from_directory
to create a dataset from the input directory. The labels
parameter is set to “inferred” to automatically infer the labels of the images based on the subdirectories in the input directory. We also specify the image size, validation split, subset, seed, and shuffle parameters for additional configuration.
Finally, we print the class names and number of classes in the dataset, which are inferred from the subdirectories in the input directory. The resulting dataset
object can be used to train a TensorFlow model, for example, by passing it to the fit()
method of a Keras model.
TextVectorization
TextVectorization
is a Keras layer that allows you to vectorize text data and convert it into a sequence of integers that can be fed into a deep learning model. It provides various text preprocessing and tokenization options, such as filtering out stop words, converting text to lowercase, and splitting text into individual words or subwords.
The TextVectorization
layer is part of the Keras preprocessing text module, and it can be used to preprocess and vectorize text data directly within a Keras deep learning model. This can help to simplify the data preprocessing and model building process by allowing you to include the text vectorization step as part of the model architecture.
Example Code
Here’s an example code that shows how to use TextVectorization
in Keras to vectorize text data:
from tensorflow.keras.layers import TextVectorization
# Define some example text data
texts = ['This is an example sentence.',
'This is another example sentence.',
'Yet another example sentence.']
# Create a TextVectorization layer
vectorizer = TextVectorization(max_tokens=1000, output_mode="int", output_sequence_length=10)
# Adapt the TextVectorization layer to the text data
vectorizer.adapt(texts)
# Convert the text to a sequence of integers
sequences = vectorizer(texts)
# Print the sequences
print(sequences)
In this code, we first define some example text data as a list of strings. We then create a TextVectorization
layer with a maximum vocabulary size of 1000 and an output mode of “int”, which outputs integer indices that correspond to each word in the vocabulary. We also set the output_sequence_length
parameter to 10, which specifies the maximum length of the output sequences.
Furthermore, we then adapt the TextVectorization
layer to the text data using the adapt
method, which updates the internal vocabulary and other parameters based on the input texts.
Finally, we convert the text data to a sequence of integers using the vectorizer
layer as a callable, which applies the text preprocessing and tokenization steps to the input text and outputs a sequence of integers that correspond to the words in the vocabulary. The resulting sequences
object is a 2D Tensor with shape (3, 10)
, where each row represents a sequence of integers corresponding to a single input text.
text_dataset_from_directory
text_dataset_from_directory
is a function in the Keras preprocessing text module that allows you to create a tf.data.Dataset
object from a directory of text files. The function automatically infers the labels of the text files based on the subdirectories in the input directory, and it can perform optional text preprocessing and vectorization on the text.
Example Code
Here’s an example code that shows how to use text_dataset_from_directory
to create a tf.data.Dataset
object from a directory of text files:
import tensorflow as tf
from keras.layers import TextVectorization
from keras.utils import text_dataset_from_directory
# Define the input directory and batch size
input_dir = "path/to/input_directory"
batch_size = 32
# Create a TextVectorization layer for preprocessing and vectorization
vectorizer = TextVectorization(max_tokens=1000, output_mode="int", output_sequence_length=10)
# Create a dataset from the input directory
dataset = text_dataset_from_directory(
input_dir,
batch_size=batch_size,
validation_split=0.2,
subset="training",
seed=42,
)
# Fit the TextVectorization layer on the text data
text_batch = dataset.map(lambda x, y: x)
vectorizer.adapt(text_batch)
# Apply the TextVectorization layer to the text data
def vectorize_text(text, label):
text = tf.expand_dims(text, -1)
return vectorizer(text), label
# Map the vectorization function to the dataset
vectorized_ds = dataset.map(vectorize_text)
# Print the class names and number of classes
class_names = dataset.class_names
num_classes = len(class_names)
print("Class names:", class_names)
print("Number of classes:", num_classes)
In this code, we first define the input directory and batch size. We then create a TextVectorization
layer with a maximum vocabulary size of 1000 and an output mode of “int”, which outputs integer indices that correspond to each word in the vocabulary. We also set the output_sequence_length
parameter to 10, which specifies the maximum length of the output sequences.
Furthermore, we then create a dataset from the input directory using the text_dataset_from_directory
function, which automatically infers the labels of the text files based on the subdirectories in the input directory. We also specify the validation_split
, subset
, seed
, and other parameters for additional configuration.
Next, we fit the vectorizer
layer on the text data using the adapt
method, which updates the internal vocabulary and other parameters based on the input text. We also define a vectorization function vectorize_text
that applies the vectorizer
layer to each text input in the dataset and returns a tuple of the vectorized text and the label.
Finally, we map the vectorize_text
function to the dataset
object using the map
method, which applies the function to each element in the dataset and outputs a new dataset with the vectorized text and labels. We also print the class names and number of classes in the dataset, which are inferred from the subdirectories in the input directory. The resulting vectorized_ds
object can be used to train a TensorFlow model, for example, by passing it to the fit()
method of a Keras model.
tf.data.TextLineDataset
tf.data.TextLineDataset
is a class in TensorFlow that allows you to create a tf.data.Dataset
object from one or more text files. Each line in the text file(s) is treated as a separate example in the dataset. This class can be used to build custom data pipelines for processing text data in TensorFlow.
Example Code
Here’s an example code that shows how to use tf.data.TextLineDataset
to create a tf.data.Dataset
object from a text file:
import tensorflow as tf
# Define the input text file and batch size
input_file = "path/to/input_file.txt"
batch_size = 32
# Create a TextLineDataset from the input file
dataset = tf.data.TextLineDataset(input_file)
# Define a function to preprocess the text
def preprocess_text(text):
# Perform some text preprocessing here
return text
# Apply the preprocessing function to the text data
dataset = dataset.map(preprocess_text)
# Batch and shuffle the dataset
dataset = dataset.batch(batch_size)
dataset = dataset.shuffle(buffer_size=10000)
# Print the first batch of data
for batch in dataset.take(1):
print(batch)
In this code, we first define the input text file and batch size. We then create a TextLineDataset
object from the input file using the tf.data.TextLineDataset
class. Each line in the text file is treated as a separate example in the dataset.
We then define a function preprocess_text
that performs some text preprocessing on the input text. This function can be customized to perform any necessary text preprocessing steps, such as tokenization, filtering out stop words, or converting text to lowercase.
Next, we apply the preprocess_text
function to the text data using the map
method of the dataset
object. This creates a new dataset with the preprocessed text data.
We then batch and shuffle the dataset using the batch
and shuffle
methods of the dataset
object. Finally, we print the first batch of data using the take
method of the dataset
object.
Note that tf.data.TextLineDataset
is a powerful class that can be used to build custom data pipelines for text data in TensorFlow. It can be used in combination with other TensorFlow datasets and preprocessing functions to create complex data processing pipelines.
normalize
normalize
is a function in the Keras preprocessing module that allows you to normalize the data in a dataset along a specific axis. Normalization is a common preprocessing step in machine learning that can improve the performance and stability of many algorithms. In Keras, normalize
can be used to preprocess the data before training a deep learning model.
Example Code
Here’s an example code that shows how to use normalize
in Keras to normalize the features of a dataset:
from keras.utils import load_img, img_to_array, normalize
# Load an example image
img = load_img('path/to/image.jpg')
# Convert the image to a Numpy array
x = img_to_array(img)
# Normalize the image along the channel axis
x_norm = normalize(x, axis=-1)
# Print the original and normalized data
print("Original data:\n", x)
print("Normalized data:\n", x_norm)
In this code, we first load an example image using the load_img
function of the Keras preprocessing image module. We then convert the image to a Numpy array using the img_to_array
function, which converts the PIL image object to a Numpy array with shape (height, width, channels).
Next, we normalize the data along the channel axis using the normalize
function of the Keras preprocessing module. The axis
parameter specifies the axis along which to normalize the data. In this case, we set it to -1, which normalizes the data along the channel axis.
Finally, we print the original and normalized data using the print
function.
Note that normalize
can be used with other Keras preprocessing functions, such as reshape
or scale
, to preprocess the data in different ways. Additionally, normalize
it can be used in combination with a Keras deep learning model as part of a pipeline to preprocess and normalize the data before training the model.
pad_sequences
pad_sequences
is a function in the Keras preprocessing module that allows you to pad sequences to a specified length. Padding is a common preprocessing step in natural language processing that can be used to standardize the length of text sequences in a dataset. In Keras, pad_sequences
can be used to preprocess the text data before training a deep learning model.
Example Code
Here’s an example code that shows how to use pad_sequences
in Keras to pad sequences of text data:
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
# Define some example text data
texts = ['this is a sentence',
'this is another sentence',
'a third sentence',
'yet another sentence']
# Create a Tokenizer object
tokenizer = Tokenizer(num_words=1000)
# Fit the Tokenizer on the text data
tokenizer.fit_on_texts(texts)
# Convert the text data to sequences of integer indices
sequences = tokenizer.texts_to_sequences(texts)
# Pad the sequences to a maximum length of 6
padded_sequences = pad_sequences(sequences, maxlen=6, padding='post')
# Print the original and padded sequences
print("Original sequences:\n", sequences)
print("Padded sequences:\n", padded_sequences)
# Output
Original sequences:
[[2, 3, 4, 1], [2, 3, 5, 1], [4, 6, 1], [7, 5, 1]]
Padded sequences:
[[2 3 4 1 0 0]
[2 3 5 1 0 0]
[4 6 1 0 0 0]
[7 5 1 0 0 0]]
In this code, we first define some example text data as a list of strings. We then create a Tokenizer
object and fit it on the text data using the fit_on_texts
method, which updates the internal vocabulary of the tokenizer with the words in the text data.
Next, we convert the text data to sequences of integer indices using the texts_to_sequences
method of the tokenizer
object. This creates a list of sequences, where each sequence represents a text string as a list of integer indices corresponding to the words in the vocabulary.
Finally, we pad the sequences to a maximum length of 6 using the pad_sequences
function of the Keras preprocessing module. The maxlen
parameter specifies the maximum length of the sequences, and the padding
parameter specifies whether to add padding at the beginning or end of each sequence. The resulting padded_sequences
object is a 2D Numpy array with the same number of rows as the original sequences, but with padded values at the end of each sequence.
Finally, we print the original and padded sequences using the print
function. Note that pad_sequences
can be used in combination with other Keras preprocessing functions, such as Tokenizer
, to preprocess the text data in different ways. Additionally, pad_sequences
it can be used in combination with a Keras deep learning model as part of a pipeline to preprocess and pad the data before training the model.
skipgrams
skipgrams
is a function in the Keras preprocessing module that generates skip-gram pairs from text data. Skip-grams are a popular method for training word embeddings, which are dense vector representations of words that can capture semantic and syntactic relationships between them. In Keras, skipgrams
can be used to preprocess the text data before training a deep learning model.
Example Code
Here’s an example code that shows how to use skipgrams
in Keras to generate skip-gram pairs from text data:
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import skipgrams
# Define some example text data
texts = ['this is a sentence',
'this is another sentence',
'a third sentence',
'yet another sentence']
# Create a Tokenizer object
tokenizer = Tokenizer(num_words=1000)
# Fit the Tokenizer on the text data
tokenizer.fit_on_texts(texts)
# Convert the text data to sequences of integer indices
sequences = tokenizer.texts_to_sequences(texts)
# Generate skip-gram pairs from the sequences
skip_grams = skipgrams(sequences, vocabulary_size=1000, window_size=2, negative_samples=5)
# Print the skip-gram pairs
print("Skip-grams:\n", skip_grams)
# Output
Skip-grams:
([[[2, 3, 4, 1], 759], [[2, 3, 5, 1], 737], [[7, 5, 1], 368], [[4, 6, 1], 393], [[2, 3, 4, 1], 759], [[2, 3, 5, 1], 525], [[4, 6, 1], ...
In this code, we first define some example text data as a list of strings. We then create a Tokenizer
object and fit it on the text data using the fit_on_texts
method, which updates the internal vocabulary of the tokenizer with the words in the text data.
Next, we convert the text data to sequences of integer indices using the texts_to_sequences
method of the tokenizer
object. This creates a list of sequences, where each sequence represents a text string as a list of integer indices corresponding to the words in the vocabulary.
Finally, we generate skip-gram pairs from the sequences using the skipgrams
function of the Keras preprocessing module. The vocabulary_size
parameter specifies the size of the vocabulary, the window_size
parameter specifies the size of the skip-gram window, and the negative_samples
parameter specifies the number of negative samples to generate for each positive pair.
The resulting skip_grams
object is a tuple of two lists: the first list contains the input words in the skip-gram pairs, and the second list contains the context words in the skip-gram pairs.
Finally, we print the skip-gram pairs using the print
function.
Note that skipgrams
can be used in combination with other Keras preprocessing functions, such as Tokenizer
, to preprocess the text data in different ways. Additionally, skipgrams
it can be used in combination with a Keras deep learning model as part of a pipeline to preprocess and generate skip-grams from the data before training the model.
Notice
Deprecated:tf.keras.preprocessing
APIs do not operate on tensors and are not recommended for new code. Prefer loading data with either tf.keras.utils.text_dataset_from_directory
or tf.keras.utils.image_dataset_from_directory
, and then transforming the output tf.data.Dataset
with preprocessing layers. These approaches will offer better performance and integration with the broader TensorFlow ecosystem. For more information, see the tutorials for loading text, loading images, and augmenting images, as well as the preprocessing layer guide.