Curse of Dimensionality and Handling It

  • Definition: When the number of features increases, the data becomes sparse, making it harder for models to learn patterns effectively.
  • Handling Techniques:
    1. Dimensionality Reduction: Use PCA, t-SNE, or Autoencoders to reduce features.
    2. Feature Selection: Remove irrelevant or redundant features based on correlation, feature importance, or statistical tests.
    3. Regularization: Apply L1 (Lasso) to enforce sparsity in the model.

What is Hypothesis Testing?

  • Definition: A statistical method to test an assumption (null hypothesis) about a population parameter.
  • Steps:
    1. State the null (H0) and alternative (H1) hypotheses.
    2. Choose a significance level (α, e.g., 0.05).
    3. Calculate the test statistic.
    4. Compare the p-value with α or check the critical region.
    5. Reject or fail to reject the null hypothesis.

What is Data Imbalance?

  • Definition: When the classes in a dataset are not equally represented (e.g., 90% Class A, 10% Class B).
  • Solutions:
    1. Oversampling: Use techniques like SMOTE to balance the classes.
    2. Undersampling: Reduce the majority class.
    3. Class Weights: Modify model loss to penalize misclassifications of the minority class.
    4. Ensemble Methods: Use algorithms like Random Forest or XGBoost with balanced sampling.

Explain XGBoost Algorithm

  • Definition: A gradient boosting algorithm optimized for speed and performance.
  • Features:
    1. Tree Pruning: Uses a max-depth parameter to prevent overfitting.
    2. Regularization: L1 and L2 are applied for better generalization.
    3. Handling Missing Values: Automatically learns the best split for missing values.
    4. Parallel Processing: Speeds up computation.
  • Use Case: Widely used in competitions due to its high performance on structured/tabular data.

Correlation vs. Covariance

  • Correlation: Measures the strength and direction of a linear relationship between two variables (−1 to 1).
  • Covariance: Indicates the direction of the relationship but not the strength. It’s unbounded.

Detecting Multicollinearity in a Dataset

  1. Correlation Matrix: Identify high correlation (>0.8).
  2. Variance Inflation Factor (VIF): Check VIF values (≥10 indicates multicollinearity).

Ways to Treat Multicollinearity

  1. Remove Variables: Drop highly correlated features.
  2. Combine Variables: Use PCA or dimensionality reduction.
  3. Regularization: Apply L1 or Ridge regression.

Deciding Features to Keep or Eliminate After Multicollinearity Test

  • Drop features with high VIF values or strong correlations.
  • Keep features with higher importance based on domain knowledge or model interpretability.

Explain Logistic Regression

  • Definition: A regression model for binary classification.
  • Mechanism: Applies a sigmoid function to predict probabilities.
  • Output: Classifies based on a probability threshold (e.g., 0.5).

Why Use Log Loss in Logistic Regression?

  • Log Loss measures the uncertainty of predictions by penalizing wrong probabilities, ensuring better calibration.

P-value and Its Significance

  • Definition: Probability of observing data given the null hypothesis is true.
  • Significance: If p<0.05p < 0.05, reject the null hypothesis.

Splitting Time Series Data and Evaluation Metrics

  1. Split: Use time-based splits (e.g., train-test split by dates).
  2. Metrics: Use MAE, RMSE, or MAPE for evaluation.

Model Deployment in Production

  1. Use tools like Flask, FastAPI, or cloud services (AWS, GCP).
  2. Retraining: Retrain based on data drift or periodic intervals.

XGBoost Splits Decision

  • Splits are decided by maximizing information gain at each node using metrics like Gini Index or Entropy.

Explain Precision and Recall

  • Precision: TP / (TP + FP). Focuses on reducing false positives.
  • Recall: TP / (TP + FN). Focuses on reducing false negatives.
  • Usage:
    • Use precision for scenarios where false positives are costly (e.g., spam detection).
    • Use recall for scenarios where false negatives are costly (e.g., medical diagnoses).

What is the Activation Function?

  • Definition: Non-linear functions that transform inputs to outputs in neural networks (e.g., ReLU, Sigmoid).

Explain Naive Bayes

  • Definition: Probabilistic classifier based on Bayes’ theorem.
  • Assumption: Features are conditionally independent.
  • Usage: Text classification (e.g., spam detection).

Confusion Matrix

  • Matrix showing TP, FP, FN, and TN.
  • Used for evaluating classification models.

Hyperparameters in Deep Learning

  • Examples: Learning rate, batch size, number of layers, dropout rate.

What is SMOTE?

  • Definition: Synthetic Minority Over-sampling Technique generates synthetic samples to balance classes.

Do you have experience working on Time Series?

  • Time series analysis involves ARIMA, SARIMA, or Prophet models, using train-test splits without shuffling.

Code Analysis of Global Variable

  • Global variables can lead to unintended side effects and should be minimized or managed carefully.

What is L1 and L2 Regularization?

  • L1 Regularization: Penalizes the absolute values of coefficients, leading to sparsity (feature selection).
  • L2 Regularization: Penalizes the squared values of coefficients, reducing overfitting.

What is Multicollinearity?

  • High correlation between independent variables, making it hard to interpret model coefficients.

What is Entropy and Information Gain?

  • Entropy: Measures randomness or impurity in data.
  • Information Gain: Reduction in entropy after a split in decision trees.

Steps in Making a Decision Tree

  1. Calculate entropy for the dataset.
  2. Compute information gain for all features.
  3. Split the data on the feature with the highest information gain.
  4. Repeat recursively until leaf nodes are pure or a stopping criterion is met.

Techniques Used for Sampling

  1. Random Sampling
  2. Stratified Sampling
  3. Systematic Sampling
  4. Cluster Sampling