Tutorial on Keras flow_from_dataframe

Vijayabhaskar J
7 min readSep 21, 2018

A detailed example article demonstrating the flow_from_dataframe function from Keras.

Note: This post assumes that you have at least some experience in using Keras.

Most of the Image datasets that I found online has 2 common formats, the first common format contains all the images within separate folders named after their respective class names, This is by far the most common format I always see online and Keras allows anyone to utilize the flow_from_directory function to easily the images read from the disc and perform powerful on the fly image augmentation with the ImageDataGenerator.

The second most common format I found online is, all the images are present inside a single directory and their respective classes are mapped in a CSV or JSON file, but Keras doesn’t support this earlier and one would have to move the images to separate directories with their respective classes names or write a custom generator to handle this case, So I have written a function flow_from_dataframe that recently got accepted to the official keras-preprocessing git repo, that allows you to input a Pandas dataframe which contains the filenames(with or without the extensions) column and a column which has the class names and directly read the images from the directory with their respective class names mapped.

To use the flow_from_dataframe function, you would need pandas installed.
You could do that by pip install pandas

Note: Make sure you’re using the latest keras-preprocessing library by installing it directly from the Github repo. It contains many changes from the one that resides under keras.preprocessing.

Make sure you uninstall the older keras-preprocessing that is installed when you’re installing keras by executing the command

pip uninstall keras-preprocessing

and install the keras preprocessing library by executing the command

pip install git+https://github.com/keras-team/keras-preprocessing.git

And finally, restart the kernel if needed.

Previously, One should have to write a custom generator if they have to perform regression or predict multiple columns and utilize the image augmentation capabilities of the ImageDataGenerator, now you can just have the target values as just another column/s (must be numerical datatype) in your dataframe, simply provide the column names to the flow_from_dataframe and that’s it! Now you can now use all the augmentations provided by the ImageDataGenerator.

First, download the dataset and save the image files under a single directory.
For example, I’m going to use the dataset https://www.kaggle.com/c/cifar-10/data
If you download and extract the train.7z and test.7z you would get two folders named “train” and “test” each contains all the images under these folders, and also you have to download the trainLabels.csv file which maps the filenames of the training images to their respective classes.

Let’s dive into the code!

Import all the stuff needed and read the CSV file with pandas.

from keras.models import Sequential#Import from keras_preprocessing not from keras.preprocessingfrom keras_preprocessing.image import ImageDataGenerator
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization
from keras.layers import Conv2D, MaxPooling2D
from keras import regularizers, optimizers
import pandas as pd
import numpy as np
def append_ext(fn):
return fn+".png"
traindf=pd.read_csv(“./trainLabels.csv”,dtype=str)
testdf=pd.read_csv("./sampleSubmission.csv",dtype=str)
traindf["id"]=traindf["id"].apply(append_ext)
testdf["id"]=testdf["id"].apply(append_ext)
datagen=ImageDataGenerator(rescale=1./255.,validation_split=0.25)

You would have noticed, I appended “.png” to all the filenames in the “id” column of the dataframe to convert the file ids to actual filenames(depending upon the dataset you might want to handle this accordingly), previously that was handled automatically by “has_ext” attribute which is now deprecated for various reasons.

First 5 rows of traindf

Notice below that I split the train set to 2 sets one for training and the other for validation just by specifying the argument validation_split=0.25 which splits the dataset into to 2 sets where the validation set will have 25% of the total images.
If you wish you can also split the dataframe into 2 explicitly and pass the dataframes to 2 different flow_from_dataframe functions.

train_generator=datagen.flow_from_dataframe(
dataframe=traindf,
directory="./train/",
x_col="id",
y_col="label",
subset="training",
batch_size=32,
seed=42,
shuffle=True,
class_mode="categorical",
target_size=(32,32))
valid_generator=datagen.flow_from_dataframe(
dataframe=traindf,
directory="./train/",
x_col="id",
y_col="label",
subset="validation",
batch_size=32,
seed=42,
shuffle=True,
class_mode="categorical",
target_size=(32,32))
test_datagen=ImageDataGenerator(rescale=1./255.)test_generator=test_datagen.flow_from_dataframe(
dataframe=testdf,
directory="./test/",
x_col="id",
y_col=None,
batch_size=32,
seed=42,
shuffle=False,
class_mode=None,
target_size=(32,32))

Since I used the validation_split to split the dataset I have to specify which set is to be used for which flow_from_dataframe function. So, we have this subset argument which takes either “training” or “validation”.

Arguments specific to flow_from_dataframe:

  • directory — (str)Path to the directory which contains all the images.
    set this to None if your x_col contains absolute_paths pointing to each image files instead of just filenames.
  • x_col — (str) The name of the column which contains the filenames of the images.
  • y_col — (str or list of str) If class_mode is not “raw” or not “input” you should pass the name of the column which contains the class names.
    None, if used for test_generator.
  • class_mode — (str) Similar to flow_from_directory, this accepts “categorical”(default), ”binary”, ”sparse”, ”input”, None and also an extra argument “raw”.
    If class_mode is set to “raw” it treats the data in the column or list of columns of the dataframe as raw target values(which means you should be sure that data in these columns must be of numerical datatypes), will be helpful if you’re building a model for regression task like predicting the angle from the images of steering wheel or building a model that needs to predict multiple values at the same time.
    For Test generator: Set this to None, to return only the images.
  • batch_size: For train and valid generator you can keep this according to your needs but for test generator:
    Set this to some number that divides your total number of images in your test set exactly.
    Why this only for test_generator?
    Actually, you should set the “batch_size” in both train and valid generators to some number that divides your total number of images in your train set and valid respectively, but this doesn’t matter before because even if batch_size doesn’t match the number of samples in the train or valid sets and some images gets missed out every time we yield the images from generator, but it would be sampled the very next epoch you train.
    But for the test set, you should sample the images exactly once, no less or no more. If Confusing, just set it to 1(but maybe a little bit slower).
  • shuffle: Set this to False(For Test generator only, for others set True), because you need to yield the images in “order”, to predict the outputs and match them with their unique ids or filenames.
  • drop_duplicates: If you’re for some reason don’t want duplicate entries in your dataframe’s x_col, set this to False, default is True.
  • validate_filenames: whether to validate image filenames in x_col. If True, invalid images will be ignored. Disabling this option can lead to speed-up in the instantiation of this class if you have a huge amount of files, default is True.

Build the model:

model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',
input_shape=(32,32,3)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
model.compile(optimizers.rmsprop(lr=0.0001, decay=1e-6),loss="categorical_crossentropy",metrics=["accuracy"])

Fitting the Model:

STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size
STEP_SIZE_TEST=test_generator.n//test_generator.batch_size
model.fit_generator(generator=train_generator,
steps_per_epoch=STEP_SIZE_TRAIN,
validation_data=valid_generator,
validation_steps=STEP_SIZE_VALID,
epochs=10
)

Evaluate the model

model.evaluate_generator(generator=valid_generator,
steps=STEP_SIZE_TEST)

Since we are evaluating the model, we should treat the validation set as if it was the test set. So we should sample the images in the validation set exactly once(if you are planning to evaluate, you need to change the batch size of the valid generator to 1 or something that exactly divides the total num of samples in validation set), but the order doesn’t matter so let “shuffle” be True as it was earlier.

For predicting the model you can use flow_from_directory because it doesn’t make sense to me to use a dataframe that has no class names, instead you can just find the images from the directory and predict from it.
If you want a tutorial to predict using flow_from_directory, just follow the last part of this tutorial where I discuss about predicting.
https://medium.com/@vijayabhaskar96/tutorial-image-classification-with-keras-flow-from-directory-and-generators-95f75ebe5720

If you insist you can use the flow_from_dataframe to predict too!

Predict the output

test_generator.reset()
pred=model.predict_generator(test_generator,
steps=STEP_SIZE_TEST,
verbose=1)
  • You need to reset the test_generator before whenever you call the predict_generator. This is important, if you forget to reset the test_generator you will get outputs in a weird order.
predicted_class_indices=np.argmax(pred,axis=1)

now predicted_class_indices has the predicted labels, but you can’t simply tell what the predictions are, because all you can see is numbers like 0,1,4,1,0,6…
and most importantly you need to map the predicted labels with their unique ids such as filenames to find out what you predicted for which image.

labels = (train_generator.class_indices)
labels = dict((v,k) for k,v in labels.items())
predictions = [labels[k] for k in predicted_class_indices]

Finally, save the results to a CSV file.

filenames=test_generator.filenames
results=pd.DataFrame({"Filename":filenames,
"Predictions":predictions})
results.to_csv("results.csv",index=False)

--

--