Journey LLM 5: Feed Forward Neural Networks from Scratch

Akshay Jain
Stackademic
Published in
7 min readMay 1, 2024

--

Feedforward neural networks are the inspiration from the biological neural networks, which basically contain infinite neural networks connected in loops, that pass the messages. Generally, to solve regression or classification problems, the last layer of the neural network contains only one unit. The obtained output with one unit of data is not efficient, so people invented Feedforward neural networks by utilizing human nerves architecture.

An example of Feed-Froward Neural Network with one hidden layer (with 3 neurons)

To understand the above image, let’s take a generalized example. If we are walking along the road, and suddenly a cow has been encountered in front of us, then cow’s image become an input layer. Now our brain needs to process all the information and give you some possible cases like change the way, threatening cow to move away or simply wait doing nothing till cow move another path. So this process is done under hidden layers. From all the possible outcomes you will select the good output that is best with our situation. This is called output layer.

History and Evolution:

The origins of Feed Forward Neural Networks ( FNN) can be traced back to work of Warren McCulloch and Walter Pitts in 1943. Their model, known as the McCulloch-Pitts neuron , laid the foundation for understanding how simple processing units could perform logical operations.

The concept of FNNs gained significant traction when Frank Rosenblatt introduced the Perceptron in 1957. The Perceptron a single layer network with binary inputs and outputs , could learn simple linear relationships. Perceptron, while considered, an extreme learning machine, was not yet a deep learning network.

The Group Method of Data Handling ( GMDH), developed by Alexey G. Ivakhnenko in 1965, was based on polynomial regression and aimed at identifying non-linear relationships between input and output variables. The variant Multilayer GMDH explicitly builds a layered architecture with hidden layers to learn complex relationships between input and output variables. The Polynomial GMDH variant utilizes polynomial activation functions in its hidden layers to achieve non-linearity. This structure closely resembles a Feedforward Neural Network with hidden layers.

Data and Task: Let’s try to understand this example by taking moon dataset example from scikit-learn library. We will try to correctly classify the moon dataset using feed forward neural network.

my_cmap=matplotlib.colors.LinearSegmentedColormap.from_list('',['red',"green"])

data,labels=make_moons(n_samples=1000,noise=0.2,random_state=0)
print(data.shape,labels.shape)
plt.scatter(data[:,0],data[:,1],c=labels,cmap=my_cmap)
plt.show()

X_train,X_val,Y_train,Y_val=train_test_split(data,labels,stratify=labels,random_state=0)
print(X_train.shape,X_val.shape)
print(Y_train.shape,Y_val.shape)

plt.scatter(X_train[:,0],X_train[:,1],c=Y_train,cmap=my_cmap)

Model: We will understand all the different functions and terminology in feed forward neural network

Terminology:

W: Weights, W{layer number}{Next Neuron Layer}{Neuron Input} W213 refers to weight corresponding to 3rd input neuron on 1st neuron of 2nd hidden layer

B: Biases

a: Pre-activation function a(x)=W.h(x) +b

h: Activation: a(x)=g(a(x)) where g is called activation function(sigmoid / softmax generally)

The pre-activation outputs of the first layer a11, a12, a12 are calculated using matrix vector multiplication

The activation values are as follows

How to select output layer for Multi Class Classification/Regression?

This image depicts that output layer for classification (at left) and regression (at right) problem

For Multi Class problem, we use softmax to calculate final probability where each softmax value always greater than 0, we are not choosing linear in this case , because for negative input, linear can give negative input, which is not possible in MCC problem.


class FFSN_MultiClass:
##This function will help to initialize the weights and biases
def __init__(self,n_inputs,n_outputs,hidden_sizes=[3]):
self.nx=n_inputs
self.ny=n_outputs
self.nh=len(hidden_sizes)
self.sizes=[self.nx]+hidden_sizes+[self.ny]
self.W={}
self.B={}

for i in range(self.nh+1):
self.W[i+1]=np.random.randn(self.sizes[i],self.sizes[i+1])
self.B[i+1]=np.zeros((1,self.sizes[i+1]))
#Sigmoid Function
def sigmoid(self,x):
return 1.0/(1.0+np.exp(-x))
#Softmax Function
def softmax(self,x):
exps=np.exp(x)
return exps/np.sum(exps)

#Forward Pass
def forward_pass(self,x):
self.A={}
self.H={}
self.H[0]=x.reshape(1,-1)
for i in range(self.nh):
self.A[i+1]=np.matmul(self.H[i],self.W[i+1])+self.B[i+1]
self.H[i+1]=self.sigmoid(self.A[i+1])
self.A[self.nh+1]=np.matmul(self.H[self.nh],self.W[self.nh+1])+self.B[self.nh+1]
self.H[self.nh+1]=self.softmax(self.A[self.nh+1])
return self.H[self.nh+1]

def predict(self,X):
Y_pred=[]
for x in X:
y_pred=self.forward_pass(x)
Y_pred.append(y_pred)
return np.array(Y_pred).squeeze()

def predict_H(self,X,layer,neuron):
assert layer<len(self.sizes) ,"Layer cant be greater than maximum layers"
assert neuron<self.sizes[layer] ,"Exceeds the total neuron in layer{}".format(layer)

Y_pred=[]
for x in X:
y_pred=self.forward_pass(x)
Y_pred.append([item[neuron] for item in self.H[layer]])
return np.array(Y_pred)

def grad_sigmoid(self,x):
return x*(1-x)

def cross_entropy(self,label,pred):
y1=np.multiply(pred,label)
y1=y1[y1!=0]
y1=-np.log(y1)
y1=np.mean(y1)
return y1


def grad(self,x,y):
self.forward_pass(x)
self.dW={}
self.dB={}
self.dH={}
self.dA={}
L=self.nh+1
self.dA[L]=(self.H[L]-y)
for k in range(L,0,-1):
self.dW[k]=np.matmul(self.H[k-1].T,self.dA[k])
self.dB[k]=self.dA[k]
self.dH[k-1]=np.matmul(self.dA[k],self.W[k].T)
self.dA[k-1]=np.multiply(self.dH[k-1],self.grad_sigmoid(self.H[k-1]))

def fit(self,X,Y,epochs=100,initialize=True,learning_rate=1,display_loss=False,display_weights=False):
if display_loss:
loss={}

if display_weights:
self.weights=[]

if initialize:
for i in range(self.nh+1):
self.W[i+1]=np.random.randn(self.sizes[i],self.sizes[i+1])
self.B[i+1]=np.zeros((1,self.sizes[i+1]))

for epoch in tqdm_notebook(range(epochs),total=epochs,unit="epochs"):
dW={}
dB={}
for i in range(self.nh+1):
dW[i+1]=np.zeros((self.sizes[i],self.sizes[i+1]))
dB[i+1]=np.zeros((1,self.sizes[i+1]))
for x,y in zip(X,Y):
self.grad(x,y)
for i in range(self.nh+1):
dW[i+1]+=self.dW[i+1]
dB[i+1]+=self.dB[i+1]

m=X.shape[1]
for i in range(self.nh+1):
self.W[i+1] -=learning_rate * (dW[i+1]/m)
self.B[i+1] -=learning_rate * (dB[i+1]/m)

if display_loss:
Y_pred=self.predict(X)
loss[epoch]=self.cross_entropy(Y,Y_pred)

if display_weights:
weights=[]
max_neuron_layr=max(self.sizes)
for i in range(self.nh+1):
if max_neuron_layr==len(self.W[i+1][0]):
weights.append(list(itertools.chain(*self.W[i+1].tolist())))

else:
for j in self.W[i+1].tolist():
j.append(0.0)
weights.append(j)


self.weights.append(list(itertools.chain(*weights)))


if display_loss:
plt.plot(loss.values())
plt.xlabel("Epochs")
plt.ylabel("CE")
plt.show()

Binary Class Classification Loss:

Let’s compute the loss of Binary Class Classification

Here, the predicted value could be 0 or 1, Hence we don’t use SoftMax function to calculate the final probability value

Multi-Class Classification Loss:

In this example, SoftMax function is used because of multi class prediction.

Issues:

  1. Loss of neighbourhood information
  2. More parameters to optimize
  3. Translation Variance: The system may produce different responses, based on inputs shift.

Understanding Neuron Internals:

Below python code will help you to understand how each neuron is learning based on inputs and learning rates. Weights and Biases will get adjusted based on loss function.


def plot_contour(i):
out=ax.contourf(xx,yy,z,alpha=0.2,cmap=my_cmap)
ax.scatter(X_train[:,0],X_train[:,1],c=Y_train,cmap=my_cmap)
ax.set_xlim(xx.min(),xx.max())
ax.set_ylim(yy.min(),yy.max())
ax.set_xlabel("X1")
ax.set_ylabel("X2")
ax.set_title("Layer {} and neuron {}".format(l,n))
return out


animation.FuncAnimation(fig, plot_contour, frames=frames, repeat=True).save("{}_{}.mp4".format(l,n))

def save_animation(l,n):
frames=20
fig,ax=plt.subplots(figsize=(10,5))
def make_meshgrid(x,y,h=0.2):
x_min,x_max=x.min()-0.5,x.max()+0.5
y_min,y_max=y.min()-0.5,y.max()+0.5
xx,yy=np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
return xx,yy
xx,yy=make_meshgrid(X_train[:,0],X_train[:,1])
z=ffsn_multi.predict_H(np.c_[xx.ravel(),yy.ravel()],l,n).reshape(xx.shape)

def plot_contour(i):
out=ax.contourf(xx,yy,z,alpha=0.2,cmap=my_cmap)
ax.scatter(X_train[:,0],X_train[:,1],c=Y_train,cmap=my_cmap)
ax.set_xlim(xx.min(),xx.max())
ax.set_ylim(yy.min(),yy.max())
ax.set_xlabel("X1")
ax.set_ylabel("X2")
ax.set_title("Layer {} and neuron {}".format(l,n))
return out
animation.FuncAnimation(fig, plot_contour, frames=frames, repeat=True).save("{}_{}.mp4".format(l,n))

sizes=ffsn_multi.sizes
for i in range(len(sizes)):
for j in range(sizes[i]):
save_animation(i,j)
print("Animation Saved for layer {} and neuron {}".format(i,j))

from moviepy.editor import VideoFileClip, concatenate_videoclips

clips=[]
for i in range(len(sizes)):
for j in range(sizes[i]):
#print(i,j)
clips.append(VideoFileClip("{}_{}.mp4".format(i,j)))

final_clip = concatenate_videoclips(clips)
final_clip.write_videofile("my_concatenation.mp4")

Let’s see this animation in order to understand the visuals

Code:

Conclusion: Feedforward neural networks (FNNs), also known as multilayer perceptrons (MLPs), are a fundamental type of artificial neural network used in machine learning and deep learning. While FNNs have many advantages and are capable of approximating complex functions, they also come with several disadvantages and limitations:

a. Lack of Memory: FNNs do not possess memory of past inputs or outputs. They treat each input independently and do not consider the sequence or history of data. This limitation makes FNNs less suitable for tasks that require modeling sequential or time-dependent data, such as natural language processing and speech recognition.
b. Fixed Architecture: The architecture of an FNN is fixed in advance, with a predefined number of layers and neurons. This rigid structure can make it challenging to adapt the network to the complexity of a given problem. More complex problems may require larger and deeper networks, while simpler problems may not need as many layers or neurons.
c. Overfitting: FNNs are prone to overfitting, especially when dealing with small datasets. Overfitting occurs when the model learns noise in the data rather than the underlying patterns, leading to poor generalization to unseen data. Regularization techniques are often needed to mitigate this issue.
d. Choice of Hyperparameters: Setting hyperparameters such as the learning rate, batch size, and the number of hidden layers and neurons can be challenging and may require extensive experimentation. The wrong choice of hyperparameters can lead to slow convergence or suboptimal model performance.
e. Vanishing and Exploding Gradients: FNNs can suffer from the vanishing gradient problem and the exploding gradient problem. The former occurs when gradients become too small during training, making it difficult for the network to learn deep representations. The latter occurs when gradients become too large, leading to instability during training. Techniques like batch normalization and careful weight initialization help mitigate these issues.
f. Limited Interpretability: FNNs are often considered “black-box” models because it can be challenging to interpret how they arrive at their predictions. Understanding the inner workings of the model and explaining its decisions to stakeholders can be difficult, especially for complex networks.
g. Data Dependency: FNNs require large amounts of labeled data for training, and their performance heavily depends on the quality and quantity of the training data. In situations where labeled data is scarce or expensive to obtain, training an FNN can be challenging.
h. Computation Intensive: Training deep FNNs with many layers and neurons can be computationally intensive and may require specialized hardware, such as GPUs or TPUs, to achieve reasonable training times.
i. Local Minima: FNNs are susceptible to getting stuck in local minima during the optimization process. While techniques like stochastic gradient descent help escape local minima to some extent, it’s still possible for the model to converge to suboptimal solutions.

Stackademic 🎓

Thank you for reading until the end. Before you go:

--

--