1. What is a decision tree?

A decision tree is a supervised learning algorithm used for classification and regression tasks. 

It works by splitting data into branches based on feature values, forming a tree-like structure. 

Each internal node represents a feature or attribute, branches represent decisions or rules, and leaf nodes represent outcomes or predictions. 

Decision trees are easy to interpret but can overfit, which can be mitigated using techniques like pruning.


2. What is NLP?

Natural Language Processing (NLP) is a field of artificial intelligence focused on the interaction between computers and human language. 

Its goal is to enable machines to understand, interpret, and respond to text or speech data effectively. 

Common NLP tasks include sentiment analysis, machine translation, named entity recognition, and text summarization.


3. What is cross-validation?

Cross-validation is a statistical method used to assess the performance of a machine learning model. 

It involves splitting the dataset into training and validation sets multiple times to evaluate how well the model generalizes to unseen data. 

The most common type is k-fold cross-validation, where the data is divided into k subsets, and the model is trained and tested k times, each time using a different subset as the validation set.


4. What is statistical power?

Statistical power is the probability that a statistical test will correctly reject the null hypothesis when the alternative hypothesis is true. 

It indicates the likelihood of detecting an effect if there is one. Power is influenced by the sample size, effect size, significance level, and variance in the data. 

High power reduces the risk of Type II errors (false negatives).


5. What is selection bias?

Selection bias occurs when the sample used in a study is not representative of the population being analyzed. 

This bias can distort the results and lead to invalid conclusions. It can arise due to non-random sampling, dropouts, or data collection methods. 

To minimize selection bias, random sampling or stratified sampling methods are often used.


6. What is logistic regression?

Logistic regression is a statistical method used for binary classification problems. 

It predicts the probability of an outcome belonging to one of two classes by using a logistic function to model the relationship between the independent variables and the dependent variable. 

The output is a value between 0 and 1, which can be thresholded to assign a class label.


7. Define the term deep learning.

Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers (deep architectures) to model complex patterns in data. 

It excels in tasks like image recognition, natural language processing, and speech recognition. 

Deep learning models learn hierarchical feature representations, making them highly effective for unstructured data.


8. What is Ensemble Learning?

Ensemble learning is a technique that combines the predictions of multiple machine learning models to improve overall performance. 

The idea is to leverage the strengths of individual models to reduce errors and variability. Common ensemble methods include bagging (e.g., Random Forest) and boosting (e.g., Gradient Boosting).


9. What is Boosting?

Boosting is an ensemble learning technique that sequentially trains weak learners (often decision trees) and combines their predictions to create a strong learner. 

Each weak model focuses on correcting the errors of its predecessors, improving accuracy over iterations. 

Popular boosting algorithms include AdaBoost, Gradient Boosting, and XGBoost.


10. What is multicollinearity, and what to do with it?

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it difficult to determine the individual effect of each variable. 

It can inflate standard errors and reduce model interpretability. To address multicollinearity, you can:

  • Remove one of the correlated variables.
  • Use dimensionality reduction techniques like PCA.
  • Apply regularization methods like Ridge Regression.

11. What are recommender systems?

Recommender systems are algorithms designed to suggest relevant items to users based on their preferences or behaviour. 

They are widely used in e-commerce, streaming services, and social media. There are two main types:

  • Collaborative filtering: Based on user-item interactions.
  • Content-based filtering: Based on item features.

12. What are feature vectors?

Feature vectors are numerical representations of data points in a multidimensional space, used as input to machine learning models. 

Each element of the vector represents a feature or attribute of the data. Feature vectors enable models to process structured and unstructured data effectively.


13. What is collaborative filtering?

Collaborative filtering is a recommendation technique that relies on user-item interactions to suggest items. 

It assumes that users with similar preferences will like similar items. Collaborative filtering can be:

  • User-based: Finds users with similar behaviour.
  • Item-based: Finds items that are often consumed together.

14. What is the goal of A/B Testing?

The goal of A/B testing is to compare two or more variants of a webpage, feature, or process to determine which performs better. 

It involves splitting users into groups and analyzing metrics like click-through rates, conversions, or engagement to make data-driven decisions.


15. What is recall?

Recall is a performance metric in classification that measures the proportion of actual positive cases correctly identified by the model. It is calculated as:

Recall=True PositivesTrue Positives + False Negatives\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}

High recall indicates that the model is effective at identifying positive cases.


16. What are data science tools?

Data science tools are software and frameworks used to analyze, process, and visualize data. Common tools include:

  • Programming languages: Python, R
  • Libraries: Pandas, NumPy, TensorFlow, PyTorch
  • Platforms: Jupyter, Google Colab
  • Visualization tools: Matplotlib, Tableau, Power BI
  • Big data tools: Hadoop, Spark

17. What is survivorship bias?

Survivorship bias is a type of sampling bias where only the "survivors" or successful cases are analyzed, ignoring those that did not make it. 

This can lead to overly optimistic conclusions. For example, studying only successful startups without considering failed ones can distort the understanding of success factors.