What’s artificial information?
Information created by a pc supposed to duplicate or increase present information.
Why is it helpful?
We have now all skilled the success of ChatGPT, Llama, and extra lately, DeepSeek. These language fashions are getting used ubiquitously throughout society and have triggered many claims that we’re quickly approaching Synthetic Normal Intelligence — AI able to replicating any human perform.
Earlier than getting too excited, or scared, relying in your perspective — we’re additionally quickly approaching a hurdle to the development of those language fashions. In accordance with a paper printed by a gaggle from the analysis institute, Epoch [1], we’re operating out of knowledge. They estimate that by 2028 we could have reached the higher restrict of doable information upon which to coach language fashions.

What occurs if we run out of knowledge?
Nicely, if we run out of knowledge then we aren’t going to have something new with which to coach our language fashions. These fashions will then cease enhancing. If we wish to pursue Synthetic Normal Intelligence then we’re going to must provide you with new methods of enhancing AI with out simply growing the quantity of real-world coaching information.
One potential saviour is artificial information which might be generated to imitate present information and has already been used to enhance the efficiency of fashions like Gemini and DBRX.
Artificial information past LLMs
Past overcoming information shortage for big language fashions, artificial information can be utilized within the following conditions:
- Delicate Information — if we don’t wish to share or use delicate attributes, artificial information might be generated which mimics the properties of those options whereas sustaining anonymity.
- Costly information — if amassing information is dear we are able to generate a big quantity of artificial information from a small quantity of real-world information.
- Lack of knowledge — datasets are biased when there’s a disproportionately low variety of particular person information factors from a selected group. Artificial information can be utilized to stability a dataset.
Imbalanced datasets
Imbalanced datasets can (*however not at all times*) be problematic as they might not include sufficient data to successfully practice a predictive mannequin. For instance, if a dataset incorporates many extra males than girls, our mannequin could also be biased in the direction of recognising males and misclassify future feminine samples as males.
On this article we present the imbalance within the well-liked UCI Grownup dataset [2], and the way we are able to use a variational auto-encoder to generate Artificial Information to enhance classification on this instance.
We first obtain the Grownup dataset. This dataset incorporates options similar to age, training and occupation which can be utilized to foretell the goal consequence ‘earnings’.
# Obtain dataset right into a dataframe
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/grownup/grownup.information"
columns = [
"age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
"occupation", "relationship", "race", "sex", "capital-gain",
"capital-loss", "hours-per-week", "native-country", "income"
]
information = pd.read_csv(url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
# Drop rows with lacking values
information = information.dropna()
# Break up into options and goal
X = information.drop(columns=["income"])
y = information['income'].map({'>50K': 1, '<=50K': 0}).values
# Plot distribution of earnings
plt.determine(figsize=(8, 6))
plt.hist(information['income'], bins=2, edgecolor="black")
plt.title('Distribution of Earnings')
plt.xlabel('Earnings')
plt.ylabel('Frequency')
plt.present()
Within the Grownup dataset, earnings is a binary variable, representing people who earn above, and beneath, $50,000. We plot the distribution of earnings over the complete dataset beneath. We will see that the dataset is closely imbalanced with a far bigger variety of people who earn lower than $50,000.

Regardless of this imbalance we are able to nonetheless practice a machine studying classifier on the Grownup dataset which we are able to use to find out whether or not unseen, or check, people must be categorized as incomes above, or beneath, 50k.
# Preprocessing: One-hot encode categorical options, scale numerical options
numerical_features = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
categorical_features = [
"workclass", "education", "marital-status", "occupation", "relationship",
"race", "sex", "native-country"
]
preprocessor = ColumnTransformer(
transformers=[
("num", StandardScaler(), numerical_features),
("cat", OneHotEncoder(), categorical_features)
]
)
X_processed = preprocessor.fit_transform(X)
# Convert to numpy array for PyTorch compatibility
X_processed = X_processed.toarray().astype(np.float32)
y_processed = y.astype(np.float32)
# Break up dataset in practice and check units
X_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_processed, y_processed, test_size=0.2, random_state=42)
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.match(X_model_train, y_model_train)
# Make predictions
y_pred = rf_classifier.predict(X_model_test)
# Show confusion matrix
plt.determine(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Precise")
plt.title("Confusion Matrix")
plt.present()
Printing out the confusion matrix of our classifier exhibits that our mannequin performs pretty nicely regardless of the imbalance. Our mannequin has an total error price of 16% however the error price for the optimistic class (earnings > 50k) is 36% the place the error price for the destructive class (earnings < 50k) is 8%.
This discrepancy exhibits that the mannequin is certainly biased in the direction of the destructive class. The mannequin is steadily incorrectly classifying people who earn greater than 50k as incomes lower than 50k.
Under we present how we are able to use a Variational Autoencoder to generate artificial information of the optimistic class to stability this dataset. We then practice the identical mannequin utilizing the synthetically balanced dataset and scale back mannequin errors on the check set.

How can we generate artificial information?
There are many totally different strategies for producing artificial information. These can embody extra conventional strategies similar to SMOTE and Gaussian Noise which generate new information by modifying present information. Alternatively Generative fashions similar to Variational Autoencoders or Normal Adversarial networks are predisposed to generate new information as their architectures study the distribution of actual information and use these to generate artificial samples.
On this tutorial we use a variational autoencoder to generate artificial information.
Variational Autoencoders
Variational Autoencoders (VAEs) are nice for artificial information technology as a result of they use actual information to study a steady latent area. We will view this latent area as a magic bucket from which we are able to pattern artificial information which carefully resembles present information. The continuity of this area is one in all their huge promoting factors because it means the mannequin generalises nicely and doesn’t simply memorise the latent area of particular inputs.
A VAE consists of an encoder, which maps enter information right into a likelihood distribution (imply and variance) and a decoder, which reconstructs the information from the latent area.
For that steady latent area, VAEs use a reparameterization trick, the place a random noise vector is scaled and shifted utilizing the discovered imply and variance, guaranteeing clean and steady representations within the latent area.
Under we assemble a BasicVAE class which implements this course of with a easy structure.
- The encoder compresses the enter right into a smaller, hidden illustration, producing each a imply and log variance that outline a Gaussian distribution aka creating our magic sampling bucket. As a substitute of straight sampling, the mannequin applies the reparameterization trick to generate latent variables, that are then handed to the decoder.
- The decoder reconstructs the unique information from these latent variables, guaranteeing the generated information maintains traits of the unique dataset.
class BasicVAE(nn.Module):
def __init__(self, input_dim, latent_dim):
tremendous(BasicVAE, self).__init__()
# Encoder: Single small layer
self.encoder = nn.Sequential(
nn.Linear(input_dim, 8),
nn.ReLU()
)
self.fc_mu = nn.Linear(8, latent_dim)
self.fc_logvar = nn.Linear(8, latent_dim)
# Decoder: Single small layer
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 8),
nn.ReLU(),
nn.Linear(8, input_dim),
nn.Sigmoid() # Outputs values in vary [0, 1]
)
def encode(self, x):
h = self.encoder(x)
mu = self.fc_mu(h)
logvar = self.fc_logvar(h)
return mu, logvar
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def decode(self, z):
return self.decoder(z)
def ahead(self, x):
mu, logvar = self.encode(x)
z = self.reparameterize(mu, logvar)
return self.decode(z), mu, logvar
Given our BasicVAE structure we assemble our loss features and mannequin coaching beneath.
def vae_loss(recon_x, x, mu, logvar, tau=0.5, c=1.0):
recon_loss = nn.MSELoss()(recon_x, x)
# KL Divergence Loss
kld_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon_loss + kld_loss / x.dimension(0)
def train_vae(mannequin, data_loader, epochs, learning_rate):
optimizer = optim.Adam(mannequin.parameters(), lr=learning_rate)
mannequin.practice()
losses = []
reconstruction_mse = []
for epoch in vary(epochs):
total_loss = 0
total_mse = 0
for batch in data_loader:
batch_data = batch[0]
optimizer.zero_grad()
reconstructed, mu, logvar = mannequin(batch_data)
loss = vae_loss(reconstructed, batch_data, mu, logvar)
loss.backward()
optimizer.step()
total_loss += loss.merchandise()
# Compute batch-wise MSE for comparability
mse = nn.MSELoss()(reconstructed, batch_data).merchandise()
total_mse += mse
losses.append(total_loss / len(data_loader))
reconstruction_mse.append(total_mse / len(data_loader))
print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}, MSE: {total_mse:.4f}")
return losses, reconstruction_mse
combined_data = np.concatenate([X_model_train.copy(), y_model_train.cop
y().reshape(26048,1)], axis=1)
# Practice-test break up
X_train, X_test = train_test_split(combined_data, test_size=0.2, random_state=42)
batch_size = 128
# Create DataLoaders
train_loader = DataLoader(TensorDataset(torch.tensor(X_train)), batch_size=batch_size, shuffle=True)
test_loader = DataLoader(TensorDataset(torch.tensor(X_test)), batch_size=batch_size, shuffle=False)
basic_vae = BasicVAE(input_dim=X_train.form[1], latent_dim=8)
basic_losses, basic_mse = train_vae(
basic_vae, train_loader, epochs=50, learning_rate=0.001,
)
# Visualize outcomes
plt.determine(figsize=(12, 6))
plt.plot(basic_mse, label="Fundamental VAE")
plt.ylabel("Reconstruction MSE")
plt.title("Coaching Reconstruction MSE")
plt.legend()
plt.present()
vae_loss consists of two elements: reconstruction loss, which measures how nicely the generated information matches the unique enter utilizing Imply Squared Error (MSE), and KL divergence loss, which ensures that the discovered latent area follows a standard distribution.
train_vae optimises the VAE utilizing the Adam optimizer over a number of epochs. Throughout coaching, the mannequin takes mini-batches of knowledge, reconstructs them, and computes the loss utilizing vae_loss. These errors are then corrected through backpropagation the place the mannequin weights are up to date. We practice the mannequin for 50 epochs and plot how the reconstruction imply squared error decreases over coaching.
We will see that our mannequin learns shortly how you can reconstruct our information, evidencing environment friendly studying.

Now now we have skilled our BasicVAE to precisely reconstruct the Grownup dataset we are able to now use it to generate artificial information. We wish to generate extra samples of the optimistic class (people who earn over 50k) with the intention to stability out the lessons and take away the bias from our mannequin.
To do that we choose all of the samples from our VAE dataset the place earnings is the optimistic class (earn greater than 50k). We then encode these samples into the latent area. As now we have solely chosen samples of the optimistic class to encode, this latent area will replicate properties of the optimistic class which we are able to pattern from to create artificial information.
We pattern 15000 new samples from this latent area and decode these latent vectors again into the enter information area as our artificial information factors.
# Create column names
col_number = sample_df.form[1]
col_names = [str(i) for i in range(col_number)]
sample_df.columns = col_names
# Outline the function worth to filter
feature_value = 1.0 # Specify the function worth - right here we set the earnings to 1
# Set all earnings values to 1 : Over 50k
selected_samples = sample_df[sample_df[col_names[-1]] == feature_value]
selected_samples = selected_samples.values
selected_samples_tensor = torch.tensor(selected_samples, dtype=torch.float32)
basic_vae.eval() # Set mannequin to analysis mode
with torch.no_grad():
mu, logvar = basic_vae.encode(selected_samples_tensor)
latent_vectors = basic_vae.reparameterize(mu, logvar)
# Compute the imply latent vector for this function
mean_latent_vector = latent_vectors.imply(dim=0)
num_samples = 15000 # Variety of new samples
latent_dim = 8
latent_samples = mean_latent_vector + 0.1 * torch.randn(num_samples, latent_dim)
with torch.no_grad():
generated_samples = basic_vae.decode(latent_samples)
Now now we have generated artificial information of the optimistic class, we are able to mix this with the unique coaching information to generate a balanced artificial dataset.
new_data = pd.DataFrame(generated_samples)
# Create column names
col_number = new_data.form[1]
col_names = [str(i) for i in range(col_number)]
new_data.columns = col_names
X_synthetic = new_data.drop(col_names[-1],axis=1)
y_synthetic = np.asarray([1 for _ in range(0,X_synthetic.shape[0])])
X_synthetic_train = np.concatenate([X_model_train, X_synthetic.values], axis=0)
y_synthetic_train = np.concatenate([y_model_train, y_synthetic], axis=0)
mapping = {1: '>50K', 0: '<=50K'}
map_function = np.vectorize(lambda x: mapping[x])
# Apply mapping
y_mapped = map_function(y_synthetic_train)
plt.determine(figsize=(8, 6))
plt.hist(y_mapped, bins=2, edgecolor="black")
plt.title('Distribution of Earnings')
plt.xlabel('Earnings')
plt.ylabel('Frequency')
plt.present()

We will now use our balanced coaching artificial dataset to retrain our random forest classifier. We will then consider this new mannequin on the unique check information to see how efficient our artificial information is at lowering the mannequin bias.
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.match(X_synthetic_train, y_synthetic_train)
# Step 5: Make predictions
y_pred = rf_classifier.predict(X_model_test)
cm = confusion_matrix(y_model_test, y_pred)
# Create heatmap
plt.determine(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Precise")
plt.title("Confusion Matrix")
plt.present()
Our new classifier, skilled on the balanced artificial dataset makes fewer errors on the unique check set than our authentic classifier skilled on the imbalanced dataset and our error price is now lowered to 14%.

Nonetheless, now we have not been capable of scale back the discrepancy in errors by a major quantity, our error price for the optimistic class remains to be 36%. This might be resulting from to the next causes:
- We have now mentioned how one of many advantages of VAEs is the training of a steady latent area. Nonetheless, if the bulk class dominates, the latent area would possibly skew in the direction of the bulk class.
- The mannequin might not have correctly discovered a definite illustration for the minority class as a result of lack of knowledge, making it arduous to pattern from that area precisely.
On this tutorial now we have launched and constructed a BasicVAE structure which can be utilized to generate artificial information which improves the classification accuracy on an imbalanced dataset.
Observe for future articles the place I’ll present how we are able to construct extra subtle VAE architectures which deal with the above issues with imbalanced sampling and extra.
[1] Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., & Hobbhahn, M. (2024). Will we run out of knowledge? Limits of LLM scaling primarily based on human-generated information. arXiv preprint arXiv:2211.04325, 3.
[2] Becker, B. & Kohavi, R. (1996). Grownup [Dataset]. UCI Machine Studying Repository. https://doi.org/10.24432/C5XW20.