AI Beginners’ Playground: Why You Shuffling Your Numeric Data Matters.

5 min readJan 25, 2020

Objective:

To prove how important shuffling is for your data
To specify types of data and the shuffling processes needed for each
To focus on mainly Numeric data and demonstrate one of the shuffling methods to use

Wait, ask yourself “Why should I read further?”

If you are looking only for the method and code, the concept should be succinct. But if you want to know why I thought of writing on it and how it preyed on my Neural Network predictions, read more below.

Concept:

Shuffling is the icing over the cake called Data Augmentation. This is how we did it here: We take 20 numbers that will be our training data.

int numTrainingSets = 20;

Now, let there be an array that stores the order of the indices of this data. Initially, it will naturally be 1, 2, 3, 4 …..20.

int trainingSetOrder[numTrainingSets];

Now, we will assign values to this array as follows: ith cell has value i

for(int j=0; j<numTrainingSets; ++j)
     trainingSetOrder[j] = j;

Now, comes the crucial part, defining the shuffle function.

void shuffle(int *array, size_t n)
{
    if (n > 1) 
    {
        size_t i;
        for (i = 0; i < n - 1; i++)
        {
            size_t j = i + rand() / (RAND_MAX / (n - i) + 1);
            int t = array[j];
            array[j] = array[i];
            array[i] = t;   
        }
    }
}

Now let’s understand (if you didn’t by now), how simply we implemented this: Let’s run this for a data-set with only 4 values at indices 0, 1, 2, 3. So we wanna shuffle these indices. For this, we will walk through these indices by actually traversing the array storing these indices upto 2 (i.e., n-1). The first value will obviously be 0. Now, we randomly generate another number within this range, so 0–3 here. And we will swap these two values (or rather indices). For instance, let’s just say we got 3 as our randomly generated number, we will swap 0 and 3 and now our indices (array) will look like :

3, 1, 2, 0

But this is only for the first value: ‘0’. We do the same 3 more times. Hopefully now you’ve got it. Let’s take another visual example same as above.

So, that’s all about the simple and stupid method to shuffle your data. Here’s the entire code:

#include<bits/stdc++.h>
using namespace std;void shuffle(int *array, size_t n)
{
    if (n > 1) //If no. of training examples > 1
    {
        size_t i;
        for (i = 0; i < n - 1; i++)
        {
            size_t j = i + rand() / (RAND_MAX / (n - i) + 1);
            int t = array[j];
            array[j] = array[i];
            array[i] = t;
        }
    }
}int main()
{
    int numTrainingSets = 20;
    int trainingSetOrder[numTrainingSets];
    for(int j=0; j<numTrainingSets; ++j)
        trainingSetOrder[j] = j;
    
    for(int n=0; n<1; ++n)     //n means epochs
    {   
        for(auto i: trainingSetOrder)
            cout<<i<<" , ";
        shuffle(trainingSetOrder,numTrainingSets);
    }
    for(auto i: trainingSetOrder)
        cout<<i<<" , ";;
}

The output for 20 indices would be:

Now, let’s cover the remaining 2 objectives.

Why shuffle?

Speaking of the learning process by which all machine learning methods (at least in supervised learning), the model takes in the training data, develops the knowledge or rather memory and makes predictions using this memory when asked to.

If we draw a rough analogy, the memory can be visualized as a truth table or simply a table which is filled with random values in the beginning and as the training progresses, the model keeps correcting this table. By the end of sufficient repetitions/iterations/epochs, this table is corrected to a certain extent and the model becomes capable of making predictions with an accuracy proportional to the correctness of this table. Now comes the caveat. The models have an inherent property of memorizing the data, If Not Shuffled.

I’ll explain that using an example which is derived from my experiments. Remember, above I mentioned “focus on numeric data” ? Now you ought to question yourself:

Why Numeric Data?

Firstly, look at this classification for timely reference on where we head.

We will focus on Regression. Let’s say we build a model that learns some function (like Sine, Cosine, Addition or any random wiggly function): f(x). You will be astonished to know that Neural Networks are universal function approximators, which means that no matter what random function you have, there exists a neural network which can guess the function when trained on the (x,f(x)) values for certain iterations.

Note: The function should be continuous.

We will chose an addition function. Also see Sine function and XOR gate implementation. So, our function is

f(x,y) = x + y

So, we trained the model on data like ((x=2, y=3), f=5), so, as this is our first attempt, we commit some mistake (a logical one, so no compiler complaining!) and see the error going low. We test our model by asking it to guess the output when I feed x=i and y=2i, where i is the iterator of the loop for(float i=1; i<10; ++i){}
To our surprise, we get the predicted vs actual output plot as follows:

Blue: actual result, Red: predicted result

WOW, got it right in the first attempt! But the error doesn’t seem to be minimizing much. OOPS!

Photo by Jelleke Vanooteghem on Unsplash

Now let’s understand what happened. Remember the caveat I mentioned above (with the joker rabbit!)? Have you had a student in your class who was asked to learn some “True/False” but he/she instead chose to memorize the order of answers (True-True-False-True-False). When the teacher asked a random True/False, you know what happened next. ;)

This problem arose because our model memorized the pattern in which the data changed. Look at the for loop above (we used the same loop for training too), our data keeps increasing. This is what our model memorized. Clever Boy!

Hopefully, now you see the need for Shuffling.

Reference: https://towardsdatascience.com/simple-neural-network-implementation-in-c-663f51447547