12 min. reading time

The power of machine learning is undeniable, but anyone who has attempted to train a model from scratch knows that data collection can be a tedious and time-consuming process. Taking tens of thousands of pictures for training data? Ain't nobody got time for that! But don't panic, we've got a solution to save you from drowning in data collection despair: let the AI do the heavy lifting for you! 

By using AI models to generate synthetic data, developers can save time, resources, and effort while still achieving high levels of accuracy and diversity in their training data. One such model is DALL-E, the little brother of the popular language model GPT. Like GPT, DALL-E was developed by OpenAI, but it is designed to generate images rather than text. We will demonstrate how DALL-E can be used to create a synthetic dataset for training a gesture recognition model, and discuss the advantages and downfalls of this approach.

Summary

To create a synthetic dataset for our gesture recognition model, we turned to DALL-E to generate images of hand gestures showing a scissors and a rock (fist) gesture. Using DALL-E, We generated 350 images for each class with varying backgrounds, lighting conditions, and skin tones, resulting in a diverse and realistic dataset for the model to learn from.

To further enhance the dataset, We used a technique called data augmentation using the Keras ImageDataGenerator, which allowed us to generate additional images with variations in lighting, orientation, and other factors. This helped to make the dataset more robust and better able to generalize to new, unseen data.

We then fine-tune a pre-trained MobileNet model for our gesture recognition task using transfer learning. With this approach, we achieved an accuracy of 96% on the validation data, demonstrating the effectiveness of our method. This is all done with minimal effort and in less than half an hour. The combination of AI-generated data, data augmentation, and transfer learning allows for creating highly accurate and effective models, with potential applications in a wide range of fields beyond gesture recognition. But let us start at the beginning and explain everything step by step.

Data Collection

Collecting a large and diverse dataset is not only a major challenge, but also one of the most important ways to improve the accuracy of a model in machine learning, and it can be particularly difficult when working with image recognition tasks like gesture recognition. Collecting a diverse dataset that covers a wide range of backgrounds, lighting conditions, and skin tones can be challenging, as it may require working with a large and diverse group of individuals.

Moreover, collecting real-world data can be challenging, as it may be difficult to capture a sufficient amount of data that covers all possible scenarios and variations. This can lead to a dataset that is biased or incomplete, which in turn can affect the accuracy and effectiveness of the resulting model. In contrast, AI-generated data can help overcome many of these challenges. By using an AI model like DALL-E to generate synthetic data, developers can quickly and easily create a large and diverse dataset that covers a wide range of scenarios and variations. This can help improve the accuracy and generalization of the model, as well as reduce the amount of time and resources required for data collection. However, there are also some challenges associated with using AI-generated data. For example, the generated data may not always be entirely realistic, which could lead to overfitting and reduced accuracy. By using the DALL-E API we can easily generate data with it. The quality of the synthetic data depends on the prompt input. 

Creating a Synthetic Dataset With DALL-E for Gesture Recognition

To generate the images, we used the OpenAI API to interact with DALL-E. Specifically, we provided a prompt to DALL-E that described the type of image we wanted to generate. DALL-E then used a combination of neural networks and generative models to create highly realistic images that matched our prompts.

The first step to access DALL-E is to set up the OpenAI library and your API-key. To generate your API key, take a look into the OpenAI documentation. Then we need to prepare the prompt. Here, it is advantageous to describe the images that will be used as detailed as possible. But also bring it to the point. We used prompts like “a gesture of a hand showing a peace sign in a random angle” and “a gesture of a hand showing a fist in a random angle”

For our specific use case, these prompts are totally fine, but keep in mind the more complex your problem the more detailed the prompt must be. If you want a bit more guidance and inspiration for your prompt, we recommend this Editor Guide for DALL-E. It explains the nitty-gritty details!

That way, we generated 350 images for each class, resulting in a total of 700 images. We then split the dataset into two parts: 250 images for training and 100 images for validation.

Here are some examples of the images DALL-E created for us:

Hand gestures with different backgrounds

The images generated could have been taken by humans in exactly the same way and are absolutely fine for our purpose. But not every image is flawless. Some images will fall off the grid and clearly show that they are AI-generated, like these:

hand gestures with interlocked fingers

These images could and usually are removed from the dataset. But in our example, we keep the data, as DALL-E created it, to show that even with such a small dataset, acceptable results for a prototype can be achieved.

Creating the synthetic dataset with DALL-E was an easy and fast process, taking only a few minutes to generate a sufficient amount of images. In addition, the synthetic data allowed us to create a dataset that was not only diverse but also highly controlled, with specific variations in background, lighting, and other factors that we wanted to include in our training data. 

In the next section, we'll discuss how we further enhanced the dataset using data augmentation, a technique that allowed us to generate additional images with variations in lighting, orientation, and other factors.

Data Augmentation

Data augmentation is a method to artificially increase the size of a given dataset. To do this, we use different operations to create new images from the given images. Examples are flipping the image, increasing contrast, or rotating the image. By doing so, it is possible to generate 10 images or more from one given image. To apply data augmentation to our dataset, we can easily use the image data generator from Keras. Simply by importing the Keras preprocessing library, you can create the ImageDataGenerator and add your preferred augmentation. In our project we decided to choose a rotation of up to 30 degrees, a zoom range of up to 20 %, and a horizontal flip. 

from keras_preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rotation_range=30,
                                   zoom_range=0.2,
                                   horizontal_flip=True,
                                   fill_mode='nearest')
val_datagen = ImageDataGenerator(rotation_range=30,
                                 zoom_range=0.2,
                                 horizontal_flip=True,
                                 fill_mode='nearest')

Now we can perform the augmentation by means of a line:

train_datagen.flow(x, batch_size=1, save_to_dir=aug_path, save_prefix='aug_', save_format='jpg')

val_datagen.flow(x, batch_size=1, save_to_dir=aug_path, save_prefix='aug_', save_format='jpg')

The arguments are relatively clear, x is the image we want to augment, batch_size is the batch of images, in our case we parse one by one. And the last two arguments specify the save location of the created images, we save these as jpg.

In the next section, we will show you how we trained a model using transfer learning without thinking about the architecture of the model.

Creating a Model With Transfer Learning

Transfer learning is a powerful technique that has revolutionized the field of deep learning. It involves using pre-trained neural network models as a starting point for solving new, related problems. In other words, transfer learning allows us to leverage the knowledge learned by a model in one task and apply it to another, similar task.

One of the significant advantages of transfer learning is that it saves time and computational resources. Training a deep neural network from scratch requires a large amount of data and computing power, which may not always be available or feasible. However, with transfer learning, we can use pre-trained models that have already been trained on vast amounts of data, reducing the need for extensive training.

Another advantage of transfer learning is that it can improve the performance of a model on a new task. By using a pre-trained model's knowledge, we can obtain better results with less training data, leading to faster convergence and better generalization.

Simply by using TensorFlow we can get access to many transfer learning models. The model that we use in this example is MobileNet. The MobileNet model is a deep learning architecture that is designed to efficiently run on mobile devices with limited computational resources, such as smartphones and embedded systems. It was developed by Google researchers in 2017 and has since become popular in the field of computer vision.

One of the key features of the MobileNet model is its use of depth-wise separable convolutions, which reduce the number of parameters and computations required while still maintaining high accuracy. This makes it an ideal model for applications that require real-time processing, such as image and video classification.

We can include the model like this:

import tensorflow as tf
base_model = tf.keras.applications.mobilenet.MobileNet(weights='imagenet',
include_top=False,
input_shape=input_shape)

That way, we can download the model and give it the input_shape of our data. Setting include_top=False effectively removes the final layer of the pre-trained MobileNet model, and returns the output of the last convolutional layer instead. This output can then be used as input to another layer or model that we define ourselves, depending on the task at hand.

The next step is to specify that we don't want to train any layer of the model:

for layer in base_model.layers:
    layer.trainable = False

And then we can add our layers to the model, and also include the hyperparameters we think would do the best job.

x = base_model(inputs)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dense(32, activation='relu')(x)

predictions = tf.keras.layers.Dense(num_classes, activation='softmax')(x)

model = tf.keras.models.Model(inputs=inputs, outputs=predictions)

model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy'])

model.fit(train_generator, 
epochs=1,
steps_per_epoch=len(train_generator),
validation_data=val_generator,
validation_steps=len(val_generator)) model.save('peace_sign_model.h5')

We will not go into the details here, as it would go beyond the scope of this article. After a successful training, we can test the model.

output_AdobeExpress

With our approach, we were able to achieve 96% accuracy despite the fact that our data set was not cleaned. In order to have a comparative value, we took the trouble to clean the dataset. In doing so, we replaced about 70 images out of the total 700 images with newly generated DALL-E images that had no errors. This increased the accuracy to 97%. This is to illustrate that as long as DALL-E cannot generate perfect images for a specific use case, manual cleaning should be considered, especially in use cases where highest accuracy is required, such as medical diagnosis, but for our project the uncleaned dataset is perfectly sufficient. 

Conclusion

In this article, we have demonstrated the power of AI-generated data for deep learning, specifically in the domain of gesture recognition. With our dataset created by DALL-E we were able to achieve an accuracy of nearly 96 %, with a cleaned dataset even 97 %. This was possible because of the diverse and realistic dataset. We were able to efficiently train a gesture recognition model with minimal effort. The combination of AI-generated data, data augmentation, and transfer learning allows for creating highly accurate and effective models, with potential applications in a wide range of fields beyond gesture recognition.

The use of synthetic data generated by AI models like DALL-E can reduce the time, resources, and effort required for data collection, while still providing a diverse and representative dataset for training. Moreover, data augmentation techniques can further enhance the dataset and improve the model's generalization capabilities.

Transfer learning, on the other hand, enables us to leverage the knowledge of pre-trained models, leading to faster convergence and better performance on new tasks. This approach not only saves time and computational resources, but also improves the model's effectiveness. Overall, the combination of these techniques showcases the potential of AI-generated data.

Comments