Week-6 (10/20) Working on Project.

Today I was working on Outline on my project and figured out the Outlinbe of my whole project

Label Encoding: In label encoding, each category is assigned a unique numerical value. However, because it assumes an ordinal relationship between the categories, this may not be appropriate for all algorithms. For this reason, Scikit-learn provides the LabelEncoder.

One-Hot Encoding: For each category, this approach generates binary columns. Each category is converted into a binary vector, with 1s in the respective category column and 0s everywhere else. This is more appropriate when there is no intrinsic order in the categories. One-hot encoding is made simple by libraries like pandas.

Dummy Variables: When using one-hot encoding, you may encounter multicollinearity problems, in which one column may be predicted from the others. You can use n-1 columns in this situation and drop one category as a reference.

Frequency Counts: Features can be created based on the frequency of each category. This is beneficial if the number of categories is important information for your analysis.

Target Encoding (Mean Encoding): In some circumstances, the mean of the target variable for that category might be used to substitute categories. This can help with regression difficulties. However, concerns such as target leakage and overfitting must be addressed.

Missing Data: Determine how to handle missing categories. You can construct a separate “missing” category or use methods like mode, median, or a specific value to impute missing values.

Visualization: To show the distribution of categorical data, use plots such as bar charts, histograms, and pie charts. This can help you better grasp the facts.
Statistical Tests: To test for independence between two categorical variables, use statistical tests such as Chi-Square or Fisher’s exact test.

Feature Selection: In machine learning, feature selection strategies may be required to determine the most essential categorical variables for your model.

Decision trees, random forests, and gradient-boosting models can handle categorical data without requiring one-hot encoding. They automatically partition nodes based on categorical features.

Cross-Validation: Use suitable cross-validation approaches when working with categorical data in machine-learning models to avoid data leaking and overfitting.

Handling High cardinality refers to categorical variables having a large number of distinct categories. Techniques such as target encoding, grouping comparable categories, and employing embeddings may be useful in such instances.

Week-6 (10/18) Working on numerical data.

Stats

Minimum = 13.
Maximum = 88.
Mean = 32.6684
Median = 31.
stdev = 11.377
Skewness = 0.994064
Kurtosis = 3.91139

The statistics mentioned in class, mean, median, standard deviation, skewness, and kurtosis, provide valuable insights into the distribution of data like spread, central tendency, and shape of the data distribution.

  • Mean: It says the central tendency of the data, the mean age, would reveal the average age of the individuals in our sample.
  • Median: When the data is sorted in ascending order, the median is the middle value. The median is a measure of central tendency that is less sensitive to outliers or extreme values than the mean.
  • Standard Deviation (Stdev): The standard deviation is a measure of the spread or dispersion of the data. A higher standard deviation indicates that ages are more variable, whereas a smaller standard deviation indicates that ages are closer to the mean. It indicates how far the data points differ from the mean age.
  • Skewness: Skewness measures the asymmetry of the data distribution. A positive skew indicates that the data is skewed to the right, with the right side of the distribution having a longer tail. In the context of an age column, a positive skew may suggest that the collection contains more younger persons
  • Kurtosis: This measures the “tailedness” of the data distribution. High kurtosis shows heavy tails, or more extreme values in the data, whereas low kurtosis indicates lighter tails. Positive kurtosis implies that the distribution has more outliers or extreme values, whereas negative kurtosis suggests that the distribution contains fewer outliers.

These statistics can help you understand the age distribution in your dataset.

  • The mean and median are different so the distribution of ages is skewed to the right.
  • The skewness = 1.
  • Slightly greater than 3 for a normal distribution of 3
  • Normal distribution closely resembles the age distribution

T-Test

A t-test is a statistical hypothesis test that is used to assess whether or not there is a significant difference in the means of two groups or conditions. It’s a common approach for comparing the means of two samples to determine whether the observed differences are due to a true effect or just random variation.

There are two types of test

  • Independent Two-Sample T-Test: This test is used when comparing the means of two independent groups or samples.
  • Paired T-Test: This test is used when comparing the means of two related groups, such as before and after measurements on the same subjects or matched pairs of observations.

The choice to reject the null hypothesis in both t-tests is determined by the p-value and the significance threshold (usually 0.05). If the p-value significance level is exceeded, the null hypothesis is rejected, indicating a significant difference. If the p-value is greater than the significance level, you keep the null hypothesis because there is insufficient evidence for a meaningful difference.

Cohen’s d

Cohen’s d is a popular measure of impact size that has been elaborated upon by various scholars, including Sawilowsky.
Sawilowsky’s expansion contains extra criteria and interpretations for Cohen’s d values, offering more context for interpreting impact sizes.

Sawilowsky Method

    • Calculate the mean first.
    • Calculate the pooled Standard Deviation.
    • Calculate the Cohen’s d.
      • A small Cohen’s d value (approx. 0.2) is a small size.
      • A medium Cohen’s d value (approx. 0.5) is a moderate size.
      • A large Cohen’s d value (approx. 0.8 or higher) is a large size.

How to calculate ?

  • Calculate the Means (M1 and M2):
  • Calculate the Standard Deviations (SD1 and SD2):
  • Calculate the Pooled Standard Deviation (pooled SD):
  • Calculate Cohen’s d:

What do we get from the above ?

  • Interpretation of Effect Size: The size stressed that the effect of the magnitude should be taken into account for specific settings. An effect size that is considered minor in one domain may be significant in another, so we need to consider all the points.
  • Direction of Effect: As the measure of impact size, Cohen’s d does not include information regarding the direction of the effect. Sawilowsky suggests including direction information, such as “positive Cohen’s d” or “negative Cohen’s d,” to indicate whether the effect correlates to an increase or decrease in the outcome.

In conclusion, Sawilowsky’s extension of Cohen’s d provides context-specific information which help us to understand effect on sizes. It recognizes that the relevance of an effect varies depending on the issue.

Week-6 (10/16) GeoPy.

GepPy is a Python inbuilt library used to locate the coordinates of addresses, and counties across the globe. This can be downloaded using the PIP command “pip install geopy” or get more information from the website.

This library GeoPy includes Geopositioning, GeoListPlot, GeoHistogram, Geodistance and many more

    • Geopositioning: This is used to estimate a given map’s position or geographic coordinates (such as latitude and longitude). In short, this provides the location of a particular point of place what data is showing
    • GeoListPlot: This is used to create geographical plots of data points, which helps us visualize all the points or data on the map. we can customize the points with colors or words with respect to our convenience. This can be done using simple code: GeoListPlot[data] data -> where all out (Long, Lati) data is stored.
    • GeoHistogram: This is used to visualize the geographic data or the histogram where data points fall within a specific range. This helps us to visualize the distribution of values related to geographic features.
    • Geodistance: This is used to calculate geographic distance from one point to another and units such as miles, and kilometers are used for this to calculate the two points. this usually uses the formula of finding the distance between two points

Clustering

Why do we need clustering?

Clustering is a kind a technique used to analyze the data and in machine learning algorithms across different domains

    • Data Reduction: This will help us in simplifying extensive datasets by reducing the amount of data points to a smaller collection of cluster centers or centroids.
    • Pattern discovery: This will enable the discovery of concealed patterns and structures present in data. By categorizing similar data points into groups, it uncovers valuable insights that may not be evident when analyzing the data as a whole.
    • Segmentation and Targeting: This will be divided into groups or segments and clusters will help to divide into groups with similar behaviors or characteristics. this allows us to target specific groups or sets of data.

Week 5 10/11 New data set, New project

In today’s class professor introduced to the new data set which says about police shooting. I try to get some insights on the data set initially, I found out that the whole data is categorical data and data needs to be cleaned since there are many empty spaces or cells which can skew results of the model.

so in category data there are two types of data, nominal data and ordinal data, according to the given data name, gender belongs to the nominal data

data exhibits the property of nominal data and order of the Rank of the data is meaningful which defines the ordinal data so in this, I couldn’t find any ordinal data.

in the upcoming weeks, I’ll load the data into Python and start to get some insights on the categorical data and I’ll do some research on how to work on categorical data and I’ll clean the data.