Saturday, January 18, 2025

Understanding Overfitting and Underfitting in Machine Learning

In the realm of machine learning, overfitting and underfitting are common challenges that impede the performance of models. These issues are central to the capacity of a model to generalize well, ultimately affecting its usefulness in providing accurate and reliable predictions.

 

What is Overfitting and Underfitting?

Before delving deep into the implications of overfitting and underfitting, it's crucial to comprehend several fundamental concepts that underpin these phenomena. The terms "signal" and "noise" are pivotal in understanding the behaviour of machine learning models. Signal refers to the true underlying pattern of data that facilitates learning, while noise encompasses irrelevant and extraneous data that diminishes performance.

Similarly, bias and variance play crucial roles in model evaluation. Bias signifies the prediction error arising from oversimplifying the learning algorithm, whereas variance occurs when the model performs well with the training data but struggles with the test data.

 

Overfitting: An In-Depth Analysis

Overfitting transpires when a machine learning model endeavours to encapsulate all data points within the dataset, even to the extent of accommodating more information than necessary. This results in the model capturing noise and inaccuracies from the data, thereby undermining its efficiency and accuracy. Overfitted models often exhibit low bias and high variance, signifying their susceptibility to deviate markedly from the expected outcome.

A classic example of overfitting can be comprehended through a linear regression output, wherein the model rigorously attempts to envelop all data points, thereby resulting in suboptimal performance and prediction errors.

 

Mitigating Overfitting: Techniques and Strategies

To obviate the menace of overfitting, a slew of techniques can be employed, including cross-validation, augmenting the training dataset, feature selection, early stopping, regularization, and ensembling. These strategies are aimed at instilling a sense of balance and generalization within the model, thereby rectifying the aberrations stemming from overfitting.

Understanding Underfitting and Counteracting It

Conversely, underfitting occurs when a machine learning model fails to grasp the underlying trend inherent within the data. This phenomenon can unfold when the model is prematurely halted during the training phase, impeding its ability to discern patterns and relationships from the data. Models afflicted by underfitting exhibit high bias and low variance, ultimately leading to unreliable and inaccurate predictions.

An illustration of underfitting can be elucidated through a linear regression model output, where the model's inability to encapsulate the data points reflects its inadequacy in learning from the dataset.

 

Strategies to Combat Underfitting

To avert underfitting, measures such as prolonging the training duration and augmenting the number of features can be instrumental. These actions are designed to empower the model to learn comprehensively from the training data, thereby fostering an enhanced capacity to discern and encapsulate the dominant trend within the dataset.

 

Striving for Goodness of Fit

The ultimate ambition of machine learning models is to achieve a state of goodness of fit, where the model strikes a harmonious equilibrium between underfitting and overfitting. This state implies that the model is capable of making predictions with minimal errors, thus epitomizing the essence of generalization.

There are several methods to discern and attain the stage of goodness of fit, including resampling techniques to estimate model accuracy and the deployment of validation datasets.

 

Final Thoughts

The perils of overfitting and underfitting are ubiquitous in the realm of machine learning, underscoring the need for robust strategies and techniques to mitigate their deleterious impact. By leveraging a judicious combination of model evaluation, feature engineering, and regularization, machine learning practitioners can navigate these challenges and foster models that exude resilience, precision, and reliability.

Sunday, December 01, 2024

How to find specific table or view is used in SQL Server database

Here is how we can find a table or view in SQL database. Below query will help you find the table or view and which object has used.

select schema_name(o.schema_id) + '.' + o.name as [table],
       'is used by' as ref,
       schema_name(ref_o.schema_id) + '.' + ref_o.name as [object],
       ref_o.type_desc as object_type
from sys.objects o
join sys.sql_expression_dependencies dep
     on o.object_id = dep.referenced_id
join sys.objects ref_o
     on dep.referencing_id = ref_o.object_id
where o.type in ('V', 'U')
      --and schema_name(o.schema_id) = 'dbo'  -- put schema name here
      and 
	  o.name = 'AI_Asset_Info'   -- put table/view name here
order by [object]
  

Hope this helps!

Tuesday, November 26, 2024

How to Copy Git Repository Without History

There are several methods to do this using git clone, git push or using git archive. But I personally prefer the one using git clone.

Objective is to copy repo 1 which is source repo to a new Repo which is NewRemote repo with out commit history.

Precautions before you proceed with this:

  1. Ensure you have write access to the repository.
  2. Backup any important local changes before proceeding.
  3. This will permanently remove the old commit history.
  4. Collaborators will need to re-clone the repository.

Here are step by step git examples for this specific repo

# 1. Clone the source repository
git clone https://github.com/inagasai/SourceRepo.App.git

# 2. Enter the cloned repository directory
cd vGlence.App

# 3. Verify current branches
git branch -a

# 4. Checkout master branch
git checkout master

# 5. Create a new branch without history
git checkout --orphan clean-main

# 6. Add all files to the new branch
git add .

# 7. Commit the files with a new initial commit
git commit -m "Initial commit - reset repository history"

# 8. Delete the old main branch (if it exists)
git branch -D main 2>/dev/null

# 9. Rename current branch to main
git branch -m main

# 10. Remove the original remote
git remote remove origin

# 11. Add the original repository as a new remote
git remote add origin https://github.com/inagasai/NewRemote.App.git

# 12. Force push to overwrite the remote repository
git push -f origin main
  

Detailed Breakdown of the outcome:

  1. This process creates a new branch with no commit history.
  2. It adds all existing files to a new initial commit.
  3. Force pushes to overwrite the remote repository.
  4. Removes all previous commit history.

Hope this helps.

Monday, November 04, 2024

Using multiple environments in ASP.NET Core

ASP.NET Core configures app behavior based on the runtime environment using an environment variable.

IHostEnvironment.EnvironmentName can be set to any value, but the following values are provided by the framework:

  • Development : The launchSettings.json file sets ASPNETCORE_ENVIRONMENT to Development on the local machine.
  • Staging
  • Production : The default if DOTNET_ENVIRONMENT and ASPNETCORE_ENVIRONMENT have not been set.

When comparing appsettings.development.json and appsettings.json, the key difference lies in their deployment environments. appsettings.development.json is typically used for development and testing environments, whereas appsettings.json is used for production environments.

The .development.json file contains sensitive information such as database credentials and API keys, which are not committed to source control and are generated locally. In contrast, appsettings.json contains non-sensitive configuration settings that are committed to source control and used in production.

Here is how this con be done in Program.cs file

public class Program
    {
        public static void Main(string[] args)
        {
            BuildWebHost(args).Run();

        }

        public static IWebHost BuildWebHost(string[] args) =>
         WebHost.CreateDefaultBuilder(args)
           .UseStartup<Startup>()
           .ConfigureAppConfiguration((context, config) =>
           {
               var env = context.HostingEnvironment;
               config.AddJsonFile("appsettings.json", optional: false, reloadOnChange: true)
                     .AddJsonFile($"appsettings.{env.EnvironmentName}.json", optional: true, reloadOnChange: true);
           })
           .Build();
    }
  

Here is sample appsettings.json file

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "ApiSettings": {
  },
  "AllowedHosts": "*",
  "isLocal": "1",
  "Email": {
  },
  "LanguageService": {
  }
}
  

These appsettings.json or appsettings.staging.json or appsettings.production.json can be set from launchSettings.json

Here is how it looks

{
  "iisSettings": {
    "windowsAuthentication": false,
    "anonymousAuthentication": true,
    "iisExpress": {
      "applicationUrl": "http://localhost:44459",
      "sslPort": 44393
    }
  },
  "profiles": {
    "Client.PWA": {
      "commandName": "Project",
      "dotnetRunMessages": true,
      "launchBrowser": true,
      "applicationUrl": "http://localhost:5198",
      "environmentVariables": {
        "ASPNETCORE_ENVIRONMENT": "Development" //Development//Staging//Production
      }
    }
  }
}
  

ASPNETCORE_ENVIRONMENT in above launchsettings.json will determine which configuration it needs to pick. In above case, its looking for Development settings, here I have Staging and Production configured as well.

This approach helps maintain secure practices while allowing for different configuration settings between environments.

Thursday, October 03, 2024

What is Similarity Search?

Have you ever wondered how systems find things that are similar to what you're looking for, especially when the search terms are vague or have multiple variations? This is where similarity search comes into play, making it possible to find similar items efficiently.

Similarity search is a method for finding data that is similar to a query based on the data's intrinsic characteristics. It's used in many applications, including search engines, recommendation systems, and databases. The search process can be based on various techniques, including Boolean algebra, cosine similarity, or edit distances

 

Vector Representations: In technology, we represent real-world items and concepts as sets of continuous numbers called vector embeddings. These embeddings help us understand the closeness of objects in a mathematical space, capturing their deeper meanings.

 

Calculating Distances: To gauge similarity, we measure the distance between these vector representations. There are different ways to do this, such as Euclidean, Manhattan, Cosine, and Chebyshev metrics. Each method helps us understand the similarity between objects based on their vector representations.

 

Performing the Search: Once we have the vector representations and understand the distances between them, it's time to perform the search. This is where the concept of similarity search comes in. Given a set of vectors and a query vector, the task is to find the most similar items in the set for the query. This is known as nearest neighbour search.

 

Challenges and Solutions: Searching through millions of vectors can be very inefficient, which is where approximate neighbour search comes into play. It provides a close approximation of the nearest neighbours, allowing for efficient scaling of searches, especially when dealing with massive datasets. Techniques like indexing, clustering, hashing, and quantization significantly improve computation and storage at the cost of some loss in accuracy.

 

Conclusion: Similarity search is a powerful tool for finding similar items in vast datasets. By understanding the basics of this concept, we can make search systems more efficient and effective, providing valuable insights into the world of technology.

 

In summary, similarity search simplifies the process of finding similar items and is an essential tool in our technology-driven world.