1. Difference between WHERE and HAVING in SQL

  • WHERE: Filters rows before grouping.
  • HAVING: Filters groups after grouping (used with aggregate functions).

2. Basics of Logistic Regression

Logistic Regression predicts probabilities for a binary outcome using the sigmoid function. It calculates the log-odds (logistic function) and uses maximum likelihood estimation to fit the model.


3. How do you treat outliers?

  • Identify using methods like IQR or Z-scores.
  • Handle by:
    • Removing (if they are errors or noise).
    • Transforming (e.g., log transformation).
    • Capping values at a threshold.
    • Using robust models.

4. Explain Confusion Matrix

A confusion matrix summarizes prediction results:

  • True Positive (TP): Correct positive prediction.
  • False Positive (FP): Incorrect positive prediction.
  • True Negative (TN): Correct negative prediction.
  • False Negative (FN): Incorrect negative prediction.
    Metrics derived: Accuracy, Precision, Recall, F1-score.

5. Explain PCA

Principal Component Analysis reduces dimensionality by transforming data into a new coordinate system:

  • Covariance Matrix: Measures variance and relationships between features.
  • Eigenvalues: Magnitudes of variance captured by principal components.
  • Eigenvectors: Directions of principal components.
  • Steps: Compute covariance matrix, find eigenvalues/eigenvectors, sort by eigenvalue, and transform data.

6. Cut a Cake into 8 Equal Parts Using 3 Cuts

  1. First cut horizontally, slicing the cake into two layers.
  2. Make a vertical cut dividing it into four halves.
  3. A perpendicular vertical cut divides it into eight equal pieces.

7. Explain k-means Clustering

  • Unsupervised algorithm to group data into kk clusters based on similarity.
  • Steps:
    1. Initialize kk centroids.
    2. Assign points to nearest centroid.
    3. Recalculate centroids.
    4. Repeat until convergence.

8. Difference Between KNN and k-means Clustering

  • KNN (K-Nearest Neighbors): Supervised, classifies based on nearest neighbors.
  • k-means: Unsupervised, clusters data based on similarity.

9. Handle Imbalanced Dataset

  • Resampling techniques:
    • Oversampling (e.g., SMOTE).
    • Undersampling.
  • Use metrics like Precision, Recall, F1-score.
  • Algorithm adjustments: Use weighted loss functions.

10. Stock Market Prediction: Classification or Regression?

  • Classification: Predict if bankruptcy will occur (Yes/No).
  • Regression is used if predicting a continuous value like stock price.

11. Key Performance Indicators for a Product

  • Customer Satisfaction Score (CSAT).
  • Net Promoter Score (NPS).
  • Conversion Rate.
  • Retention Rate.
  • Revenue Growth.

12. Technique for Predicting Categorical Responses

  • Logistic Regression.
  • Decision Trees.
  • Naive Bayes.

13. What is Logistic Regression?

Logistic Regression predicts the probability of a categorical outcome using the sigmoid function.
Example: Predicting if a customer will churn or not.


14. Importance of Data Cleaning

  • Removes inconsistencies, duplicates, and errors.
  • Improves data quality for better model performance.
  • Reduces bias and ensures accuracy.

15. Normal Distribution

  • A symmetric, bell-shaped curve.
  • Mean = Median = Mode.
  • Defined by mean (μ\mu) and standard deviation (σ\sigma).

16. Cross-Validation

  • Technique to assess model performance by splitting data into training and validation sets.
  • Popular method: K-Fold Cross-Validation.

17. Variants of Back Propagation

  • Stochastic Gradient Descent (SGD).
  • Mini-batch Gradient Descent.
  • Adaptive methods (e.g., Adam, RMSProp).

18. What is a Random Forest?

  • Ensemble learning method using multiple decision trees.
  • Aggregates results via majority voting (classification) or averaging (regression).

19. Collaborative Filtering

  • Recommender system technique.
  • Types:
    • User-based: Finds similar users.
    • Item-based: Finds similar items.

20. Interpolation and Extrapolation

  • Interpolation: Estimating within the range of known data points.
  • Extrapolation: Predicting outside the known range.

21. Power Analysis

  • Determines sample size needed for detecting an effect.
  • Factors: Effect size, significance level, and power (1β1-\beta).

22. Difference Between Cluster and Systematic Sampling

  • Cluster Sampling: Randomly selects groups, then samples within them.
  • Systematic Sampling: Selects every nn-th item in an ordered list.

23. Are Expected Value and Mean Value Different?

  • Same for a probability distribution.
  • Mean refers to empirical data; Expected Value is theoretical.

24. Box-Cox Transformation for Normality

  • Applies a power transformation:
    y=yλ1λ,λ0y' = \frac{y^\lambda - 1}{\lambda}, \lambda \neq 0
  • Stabilizes variance and normalizes data.

25. Eigenvalue and Eigenvector

  • Eigenvalue: Magnitude of a transformation.
  • Eigenvector: Direction unaffected by transformation.

26. Do Gradient Descent Methods Always Converge?

No, convergence depends on factors like:

  • Learning rate.
  • Non-convex loss functions.
  • Proper initialization.