Week-8 (10/30) Age and Race.

While surfing into the data there are two columns called Age and Race
Age is Quantitative data and Race is Qualitative data. Checking the data type age is float64 and race is an object
Race has 6 different categories, they are Asian, Black, Hispanic, Native American, Other, and White
For the “age” variable, descriptive statistics will usually include measures summarizing the distribution and central tendency of the ages in the sample. In descriptive statistics, common examples of numerical variables such as “age” are:

  1. Mean: This is the average age, calculated by summing up all the ages and dividing by the total number of observations.
  2. Median: The middle value of the ages when they are arranged in ascending order. It’s less affected by extreme values (outliers) than the mean.
  3. Mode: The age that appears most frequently in the dataset.
  4. Range: The difference between the maximum and minimum ages in the dataset, providing an idea of the spread of ages.
  5. Standard Deviation: A measure of the dispersion or variability of ages. It tells you how spread out the ages are from the mean.

Race (Categorical Variable – object):
Summarizing the distribution of the various race categories will be the main goal of the descriptive statistics for the “race” variable. Common descriptive statistics for “race” include the following since it is a categorical variable with distinct categories (Asian, Black, Hispanic, Native American, Other, and White).

  1. Frequency Table: This table will show the count of each race category in the dataset, giving you an idea of how many individuals belong to each racial group.
  2. Percentage Distribution: You can calculate the percentage of each race category by dividing the count of each category by the total number of observations. This helps you understand the relative proportions of each racial group in the dataset.
  3. Mode: In this context, the mode represents the most common race category in the dataset.
  4. Bar Chart: A visual representation of the frequency or percentage distribution of different race categories using bar charts can provide a more intuitive view of the data.

Once the data is cleaned the AGE column is completely numeric so plotted the Q-Q plot
Data are compared to an expected theoretical distribution—typically the normal curve—in a Q-Q graphic. Plot deviations suggest departures from the theory, whereas straight lines show a close agreement. In data analysis and statistics, it is a tool for determining model fit, identifying outliers, and verifying data normalcy.

Our Q-Q plot looks like this

The Quantile-Quantile plot, or Q-Q plot for short, shows us how closely the quantiles (percentiles) in a dataset match the predicted values from a theoretical distribution, most commonly the normal distribution. A straight line connecting all of the dots in the Q-Q plot indicates that the dataset closely resembles the theoretical distribution.

once we read the data, we moved towards Race: we differentiated the data in 6 different ways and we calculated the data with respect to Mean, Median, Stdev, Variance, Skewness, and Kurtosis
Using all these data we get a clear picture and insights for the Washington police shooting.

Even we use this T-test and ANOVA test
An analysis of variance, or ANOVA, is a statistical test that compares the means of three or more groups to see if there are any noteworthy variations between them. It evaluates if there is more variation in group means than would be predicted by chance. An overall significance level, which indicates whether at least one group varies from the others considerably, is provided by an ANOVA. Post-hoc tests can determine which particular groups vary if they are significant. In experimental research, ANOVA is frequently used to examine how various factors or treatments affect a dependent variable and calculate the Null Hypothesis.

Week-7 (10/28) K-4.

K-4 Algorithm

A vital tool in unsupervised machine learning is data clustering, which is accomplished by using the K-4 algorithm, sometimes referred to as K-4 clustering. The clustering method involves assembling comparable data items according to their shared attributes. Specifically, K-4 is a variant of the widely recognized K-Means method.
“K” in the K-4 method stands for the number of clusters to be formed. K-4 explicitly seeks to separate the data into four different clusters, in contrast to K-Means, which often focuses on identifying a fixed number of clusters (K). When you have prior knowledge or a particular application that needs precisely four clusters, this can be helpful.

The K-4 algorithm works as follows: first, data points are initially assigned to clusters. Then, to reduce the within-cluster variance, these assignments are iteratively refined. The data are then divided into four clusters as a consequence of this refinement process, which is continued until convergence.

Similar to K-Means, K-4 is a flexible method that has uses in data analysis, image segmentation, and consumer segmentation, among other domains. You must select a suitable value for K (four in this case) depending on the particular issue and dataset that you are dealing with. When used appropriately, the K–4 algorithm can be a potent tool for pattern recognition and data management.

Week-7 (10/23) K-Means.

lemniscate

In quantitative analysis, a lemniscate is a mathematical curve that resembles a figure-eight (∞).

It denotes a distinct link between two variables that exhibits a balanced, symmetric correlation. This form denotes a complex, entangled relationship between the variables, which frequently necessitates the use of specialized approaches for proper modeling and interpretation. It’s an important idea to understand when dealing with non-linear connections and improving the accuracy of statistical models in a variety of analytical domains.

Dbscan

DBSCAN, or Density-Based Spatial Clustering of Applications with Noise, is a popular unsupervised clustering algorithm in data analysis.

It classifies data points according to their density, defining clusters as locations with many close points and labeling solitary points as noise. It does not require a prior specification of the number of clusters, making it suitable for a wide range of data types and sizes. DBSCAN distinguishes between core points (in dense areas), border points (on cluster edges), and noise points. It handles irregularly shaped data effectively and is excellent for locating clusters in spatial or density-related datasets, such as identifying hotspots in geographic data.

Clustering

Clustering is a technique that groups data points based on their similarity. It aims to discover patterns, structures, or correlations in data. It helps to get insights, and trends, and simplify complex datasets and decision-making by combining related data points into clusters.

    1. K-Means: Divides data into ‘k’ clusters.
    2. Hierarchical: Organizes data into a tree-like structure.
    3. DBSCAN: Identifies clusters based on data point density.
    4. Mean Shift: Finds density peaks in the data. 

K-means

K-means is a popular clustering technique used in machine learning algorithms, and when K (the number of clusters) is set to 2 it will partition data into two distinct groups

The technique begins by selecting two starting cluster centers at random and then allocates each data point to the nearest center. It computes the (typically Euclidean) distance between each point and the cluster centers, then assigns each point to the cluster with the closest center. This method is repeated until there is little variation in awarding points to clusters.

The approach attempts to minimize within-cluster variation by bringing data points inside the same cluster as close together as possible. It optimizes by recalculating cluster centers as the mean of each cluster’s data points. Because K-means can converge to a local minimum, resulting in different results on various runs, it’s typical to run it numerous times and choose the best result.

The K=2 case is handy for binary clustering (0 & 1), such as categorizing email as spam or not spam in short Yes or No. It’s also used in market segmentation, detecting customer preferences, and any other situation where data can be divided into two different groups, making it a key tool.

with K-means K=4 is a machine learning algorithm that separates a dataset into four independent clusters. It works by assigning data points based on similarity to the nearest cluster center and then recalculating the centers as the mean of the data points in each cluster. This method is effective for categorizing data into four separate groups, which can help with customer segmentation, image compression, or any other situation where categorizing data into four relevant categories is critical for analysis and decision-making.

Week-6 (10/20) Working on Project.

Today I was working on Outline on my project and figured out the Outlinbe of my whole project

Label Encoding: In label encoding, each category is assigned a unique numerical value. However, because it assumes an ordinal relationship between the categories, this may not be appropriate for all algorithms. For this reason, Scikit-learn provides the LabelEncoder.

One-Hot Encoding: For each category, this approach generates binary columns. Each category is converted into a binary vector, with 1s in the respective category column and 0s everywhere else. This is more appropriate when there is no intrinsic order in the categories. One-hot encoding is made simple by libraries like pandas.

Dummy Variables: When using one-hot encoding, you may encounter multicollinearity problems, in which one column may be predicted from the others. You can use n-1 columns in this situation and drop one category as a reference.

Frequency Counts: Features can be created based on the frequency of each category. This is beneficial if the number of categories is important information for your analysis.

Target Encoding (Mean Encoding): In some circumstances, the mean of the target variable for that category might be used to substitute categories. This can help with regression difficulties. However, concerns such as target leakage and overfitting must be addressed.

Missing Data: Determine how to handle missing categories. You can construct a separate “missing” category or use methods like mode, median, or a specific value to impute missing values.

Visualization: To show the distribution of categorical data, use plots such as bar charts, histograms, and pie charts. This can help you better grasp the facts.
Statistical Tests: To test for independence between two categorical variables, use statistical tests such as Chi-Square or Fisher’s exact test.

Feature Selection: In machine learning, feature selection strategies may be required to determine the most essential categorical variables for your model.

Decision trees, random forests, and gradient-boosting models can handle categorical data without requiring one-hot encoding. They automatically partition nodes based on categorical features.

Cross-Validation: Use suitable cross-validation approaches when working with categorical data in machine-learning models to avoid data leaking and overfitting.

Handling High cardinality refers to categorical variables having a large number of distinct categories. Techniques such as target encoding, grouping comparable categories, and employing embeddings may be useful in such instances.

Week-6 (10/18) Working on numerical data.

Stats

Minimum = 13.
Maximum = 88.
Mean = 32.6684
Median = 31.
stdev = 11.377
Skewness = 0.994064
Kurtosis = 3.91139

The statistics mentioned in class, mean, median, standard deviation, skewness, and kurtosis, provide valuable insights into the distribution of data like spread, central tendency, and shape of the data distribution.

  • Mean: It says the central tendency of the data, the mean age, would reveal the average age of the individuals in our sample.
  • Median: When the data is sorted in ascending order, the median is the middle value. The median is a measure of central tendency that is less sensitive to outliers or extreme values than the mean.
  • Standard Deviation (Stdev): The standard deviation is a measure of the spread or dispersion of the data. A higher standard deviation indicates that ages are more variable, whereas a smaller standard deviation indicates that ages are closer to the mean. It indicates how far the data points differ from the mean age.
  • Skewness: Skewness measures the asymmetry of the data distribution. A positive skew indicates that the data is skewed to the right, with the right side of the distribution having a longer tail. In the context of an age column, a positive skew may suggest that the collection contains more younger persons
  • Kurtosis: This measures the “tailedness” of the data distribution. High kurtosis shows heavy tails, or more extreme values in the data, whereas low kurtosis indicates lighter tails. Positive kurtosis implies that the distribution has more outliers or extreme values, whereas negative kurtosis suggests that the distribution contains fewer outliers.

These statistics can help you understand the age distribution in your dataset.

  • The mean and median are different so the distribution of ages is skewed to the right.
  • The skewness = 1.
  • Slightly greater than 3 for a normal distribution of 3
  • Normal distribution closely resembles the age distribution

T-Test

A t-test is a statistical hypothesis test that is used to assess whether or not there is a significant difference in the means of two groups or conditions. It’s a common approach for comparing the means of two samples to determine whether the observed differences are due to a true effect or just random variation.

There are two types of test

  • Independent Two-Sample T-Test: This test is used when comparing the means of two independent groups or samples.
  • Paired T-Test: This test is used when comparing the means of two related groups, such as before and after measurements on the same subjects or matched pairs of observations.

The choice to reject the null hypothesis in both t-tests is determined by the p-value and the significance threshold (usually 0.05). If the p-value significance level is exceeded, the null hypothesis is rejected, indicating a significant difference. If the p-value is greater than the significance level, you keep the null hypothesis because there is insufficient evidence for a meaningful difference.

Cohen’s d

Cohen’s d is a popular measure of impact size that has been elaborated upon by various scholars, including Sawilowsky.
Sawilowsky’s expansion contains extra criteria and interpretations for Cohen’s d values, offering more context for interpreting impact sizes.

Sawilowsky Method

    • Calculate the mean first.
    • Calculate the pooled Standard Deviation.
    • Calculate the Cohen’s d.
      • A small Cohen’s d value (approx. 0.2) is a small size.
      • A medium Cohen’s d value (approx. 0.5) is a moderate size.
      • A large Cohen’s d value (approx. 0.8 or higher) is a large size.

How to calculate ?

  • Calculate the Means (M1 and M2):
  • Calculate the Standard Deviations (SD1 and SD2):
  • Calculate the Pooled Standard Deviation (pooled SD):
  • Calculate Cohen’s d:

What do we get from the above ?

  • Interpretation of Effect Size: The size stressed that the effect of the magnitude should be taken into account for specific settings. An effect size that is considered minor in one domain may be significant in another, so we need to consider all the points.
  • Direction of Effect: As the measure of impact size, Cohen’s d does not include information regarding the direction of the effect. Sawilowsky suggests including direction information, such as “positive Cohen’s d” or “negative Cohen’s d,” to indicate whether the effect correlates to an increase or decrease in the outcome.

In conclusion, Sawilowsky’s extension of Cohen’s d provides context-specific information which help us to understand effect on sizes. It recognizes that the relevance of an effect varies depending on the issue.

Week-6 (10/16) GeoPy.

GepPy is a Python inbuilt library used to locate the coordinates of addresses, and counties across the globe. This can be downloaded using the PIP command “pip install geopy” or get more information from the website.

This library GeoPy includes Geopositioning, GeoListPlot, GeoHistogram, Geodistance and many more

    • Geopositioning: This is used to estimate a given map’s position or geographic coordinates (such as latitude and longitude). In short, this provides the location of a particular point of place what data is showing
    • GeoListPlot: This is used to create geographical plots of data points, which helps us visualize all the points or data on the map. we can customize the points with colors or words with respect to our convenience. This can be done using simple code: GeoListPlot[data] data -> where all out (Long, Lati) data is stored.
    • GeoHistogram: This is used to visualize the geographic data or the histogram where data points fall within a specific range. This helps us to visualize the distribution of values related to geographic features.
    • Geodistance: This is used to calculate geographic distance from one point to another and units such as miles, and kilometers are used for this to calculate the two points. this usually uses the formula of finding the distance between two points

Clustering

Why do we need clustering?

Clustering is a kind a technique used to analyze the data and in machine learning algorithms across different domains

    • Data Reduction: This will help us in simplifying extensive datasets by reducing the amount of data points to a smaller collection of cluster centers or centroids.
    • Pattern discovery: This will enable the discovery of concealed patterns and structures present in data. By categorizing similar data points into groups, it uncovers valuable insights that may not be evident when analyzing the data as a whole.
    • Segmentation and Targeting: This will be divided into groups or segments and clusters will help to divide into groups with similar behaviors or characteristics. this allows us to target specific groups or sets of data.

Week 5 10/11 New data set, New project

In today’s class professor introduced to the new data set which says about police shooting. I try to get some insights on the data set initially, I found out that the whole data is categorical data and data needs to be cleaned since there are many empty spaces or cells which can skew results of the model.

so in category data there are two types of data, nominal data and ordinal data, according to the given data name, gender belongs to the nominal data

data exhibits the property of nominal data and order of the Rank of the data is meaningful which defines the ordinal data so in this, I couldn’t find any ordinal data.

in the upcoming weeks, I’ll load the data into Python and start to get some insights on the categorical data and I’ll do some research on how to work on categorical data and I’ll clean the data.

Week 4 (10/6) Working with new library and Project Update.

I was working with this library called Seaborn which helps us to plot or visualise things in a beautiful manner so that we can get insight faster. While I was working, I found that visualizing things with graphs is more efficient compared to numbers. With all the colors and plots it’s easy to check the phrases so I used this library for plotting, training, and testing the data.

With respect to the project, I trained the data with a threshold of 0.7 for training the data and 0.3 for training data. X train and Y train are used to train the data and X test and Y test are used to test the data. Once the model is ready we’ll calculate the y-predict value with respect to X_test and find the mean error comparing the Y-test (actual value) with the predicted value.