1. Formula representation of the problem involving the price of 5 pens and 2 pencils as 20 units:
Let be the price of one pen and be the price of one pencil.
The equation can be represented as:
2. What does a linear equation having 3 variables represent?
A linear equation in three variables represents a plane in a 3D space.
The solution to this equation corresponds to all points that lie on this plane.
3. How can you calculate the probability of seeing a shooting star within a given timeframe?
If the rate of observing shooting stars is stars per unit time, the probability can be modelled using the Poisson distribution:
For at least one shooting star in time :
4. Which method depicts hierarchical data in a nested format?
Dendrograms are used in hierarchical clustering to represent hierarchical data visually.
JSON and XML formats are common for nested hierarchical data structures in computer science.
5. How do you estimate the probability of getting a head in a coin toss after observing 10 heads?
Using a Bayesian approach, start with a prior belief about the fairness of the coin (e.g., ).
After observing 10 heads, update the belief using the Beta distribution:
With a uniform prior , the posterior becomes , and the mean probability is:
6. What does it mean when p-values are high and low?
High p-value (e.g., ): Weak evidence against the null hypothesis; fail to reject .
Low p-value (e.g., ): Strong evidence against ; reject .
7. How would you handle training a model on data that exceeds available RAM capacity?
Use techniques like:
Data sampling: Train on a representative subset.
Batch processing: Load and process data in chunks.
Distributed computing: Use tools like Apache Spark or Dask.
Out-of-core learning: Algorithms like those in scikit-learn that train incrementally on batches.
8. When is a false positive more important than a false negative?
When the cost of acting unnecessarily is higher than the cost of not acting:
Example: Spam email filters falsely marking critical emails (FP) as spam is worse than allowing some spam emails through (FN).
9. What defines the analysis of data objects not complying with general data behaviour?
Anomaly detection or outlier analysis identifies data points that deviate significantly from the norm.
10. What is the difference between data analytics and data science?
Data Analytics: Focuses on analyzing existing data to find patterns, trends, and actionable insights.
Data Science: Encompasses data analytics but also includes data engineering, machine learning, and predictive modelling.
11. How should one approach solving a data analytics-based project?
Steps:
Understand the problem and define objectives.
Collect and preprocess data.
Perform exploratory data analysis (EDA).
Choose appropriate analytical methods or models.
Evaluate results and refine.
Communicate insights effectively.
12. How should one handle a dataset with variables having more than 30% missing values?
Techniques:
Drop variables if they lack significance or have excessive missing data.
Impute values using mean, median, mode, or predictive models like k-NN or MICE.
Use models robust to missing data (e.g., decision trees).
13. What does the ROC curve represent, and how is it created?
ROC (Receiver Operating Characteristic) Curve:
Plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
AUC (Area Under Curve) indicates model performance:
: Perfect classifier.
: Random guessing.
14. How can you avoid overfitting your model?
Strategies:
Use cross-validation.
Limit model complexity (e.g., pruning trees, regularization).
Increase training data.
Use dropout (neural networks).
15. Is a random forest better than a decision tree?
Generally, yes:
Random Forest reduces overfitting by aggregating results from multiple trees.
It increases stability and accuracy compared to a single decision tree.
16. How to check if the regression model fits the data well?
Metrics:
or Adjusted .
Residual plots for randomness.
Compare predictions with actual data.
17. What is the Central Limit Theorem?
Definition: For a large enough sample size, the sampling distribution of the sample mean approaches a normal distribution, regardless of the population’s distribution.
Importance: It underpins hypothesis testing and confidence intervals.
18. Why is Naive Bayes so bad?
Assumes feature independence, which rarely holds true.
To improve:
Use feature engineering to reduce dependencies.
Employ models like logistic regression or SVM.
19. How can you select for k-means?
Use the Elbow Method: Plot the sum of squared distances (inertia) against . The "elbow" indicates the optimal .
0 Comments