Curse of Dimensionality and Handling It
- Definition: When the number of features increases, the data becomes sparse, making it harder for models to learn patterns effectively.
- Handling Techniques:
- Dimensionality Reduction: Use PCA, t-SNE, or Autoencoders to reduce features.
- Feature Selection: Remove irrelevant or redundant features based on correlation, feature importance, or statistical tests.
- Regularization: Apply L1 (Lasso) to enforce sparsity in the model.
What is Hypothesis Testing?
- Definition: A statistical method to test an assumption (null hypothesis) about a population parameter.
- Steps:
- State the null (H0) and alternative (H1) hypotheses.
- Choose a significance level (α, e.g., 0.05).
- Calculate the test statistic.
- Compare the p-value with α or check the critical region.
- Reject or fail to reject the null hypothesis.
What is Data Imbalance?
- Definition: When the classes in a dataset are not equally represented (e.g., 90% Class A, 10% Class B).
- Solutions:
- Oversampling: Use techniques like SMOTE to balance the classes.
- Undersampling: Reduce the majority class.
- Class Weights: Modify model loss to penalize misclassifications of the minority class.
- Ensemble Methods: Use algorithms like Random Forest or XGBoost with balanced sampling.
Explain XGBoost Algorithm
- Definition: A gradient boosting algorithm optimized for speed and performance.
- Features:
- Tree Pruning: Uses a max-depth parameter to prevent overfitting.
- Regularization: L1 and L2 are applied for better generalization.
- Handling Missing Values: Automatically learns the best split for missing values.
- Parallel Processing: Speeds up computation.
- Use Case: Widely used in competitions due to its high performance on structured/tabular data.
Correlation vs. Covariance
- Correlation: Measures the strength and direction of a linear relationship between two variables (−1 to 1).
- Covariance: Indicates the direction of the relationship but not the strength. It’s unbounded.
Detecting Multicollinearity in a Dataset
- Correlation Matrix: Identify high correlation (>0.8).
- Variance Inflation Factor (VIF): Check VIF values (≥10 indicates multicollinearity).
Ways to Treat Multicollinearity
- Remove Variables: Drop highly correlated features.
- Combine Variables: Use PCA or dimensionality reduction.
- Regularization: Apply L1 or Ridge regression.
Deciding Features to Keep or Eliminate After Multicollinearity Test
- Drop features with high VIF values or strong correlations.
- Keep features with higher importance based on domain knowledge or model interpretability.
Explain Logistic Regression
- Definition: A regression model for binary classification.
- Mechanism: Applies a sigmoid function to predict probabilities.
- Output: Classifies based on a probability threshold (e.g., 0.5).
Why Use Log Loss in Logistic Regression?
- Log Loss measures the uncertainty of predictions by penalizing wrong probabilities, ensuring better calibration.
P-value and Its Significance
- Definition: Probability of observing data given the null hypothesis is true.
- Significance: If , reject the null hypothesis.
Splitting Time Series Data and Evaluation Metrics
- Split: Use time-based splits (e.g., train-test split by dates).
- Metrics: Use MAE, RMSE, or MAPE for evaluation.
Model Deployment in Production
- Use tools like Flask, FastAPI, or cloud services (AWS, GCP).
- Retraining: Retrain based on data drift or periodic intervals.
XGBoost Splits Decision
- Splits are decided by maximizing information gain at each node using metrics like Gini Index or Entropy.
Explain Precision and Recall
- Precision: TP / (TP + FP). Focuses on reducing false positives.
- Recall: TP / (TP + FN). Focuses on reducing false negatives.
- Usage:
- Use precision for scenarios where false positives are costly (e.g., spam detection).
- Use recall for scenarios where false negatives are costly (e.g., medical diagnoses).
What is the Activation Function?
- Definition: Non-linear functions that transform inputs to outputs in neural networks (e.g., ReLU, Sigmoid).
Explain Naive Bayes
- Definition: Probabilistic classifier based on Bayes’ theorem.
- Assumption: Features are conditionally independent.
- Usage: Text classification (e.g., spam detection).
Confusion Matrix
- Matrix showing TP, FP, FN, and TN.
- Used for evaluating classification models.
Hyperparameters in Deep Learning
- Examples: Learning rate, batch size, number of layers, dropout rate.
What is SMOTE?
- Definition: Synthetic Minority Over-sampling Technique generates synthetic samples to balance classes.
Do you have experience working on Time Series?
- Time series analysis involves ARIMA, SARIMA, or Prophet models, using train-test splits without shuffling.
Code Analysis of Global Variable
- Global variables can lead to unintended side effects and should be minimized or managed carefully.
What is L1 and L2 Regularization?
- L1 Regularization: Penalizes the absolute values of coefficients, leading to sparsity (feature selection).
- L2 Regularization: Penalizes the squared values of coefficients, reducing overfitting.
What is Multicollinearity?
- High correlation between independent variables, making it hard to interpret model coefficients.
What is Entropy and Information Gain?
- Entropy: Measures randomness or impurity in data.
- Information Gain: Reduction in entropy after a split in decision trees.
Steps in Making a Decision Tree
- Calculate entropy for the dataset.
- Compute information gain for all features.
- Split the data on the feature with the highest information gain.
- Repeat recursively until leaf nodes are pure or a stopping criterion is met.
Techniques Used for Sampling
- Random Sampling
- Stratified Sampling
- Systematic Sampling
- Cluster Sampling
0 Comments