Week-11 (11/29) Auto Regression.

Autoregression is a statistical technique for modeling the connection between a variable and its previous values. In other words, it represents the concept that prior values might impact a time series data point.
An AR(p) autoregressive model predicts the current value of a time series based on the “p” most recent observations.


Where
Yt is the current value of the time series.
Phi are the autoregressive coefficients, which represent the strength and sign of the relationship between the current value and its past values.
Y t−1,…, Y t−p are the past observations.
ϵ is the white noise or error term at time t, which is the random variation not explained by the model.

The order p defines how many previous observations are taken into account by the model. A greater p number suggests a longer memory or reliance on previous values.

Autoregressive models are commonly employed in time series analysis, especially when the data shows indications of temporal dependency. They belong to a larger class of models known as autoregressive integrated moving average (ARIMA) models, which combine autoregressive and moving average components to provide a more thorough analysis of time series data.

In practice, an AR model aids in the discovery of patterns, trends, and relationships within a time series collection, hence assisting in the prediction of future values. The phrase “white noise” refers to inexplicable changes or external effects. AR models are essential components of more advanced models such as Autoregressive Integrated Moving Average (ARIMA), which are extensively used in sectors ranging from finance to climate research and need knowledge and prediction of temporal trends.

Week-11 (11/27) Planning and strategies for Boston data set.

Exploratory Data Analysis (EDA):

Compute Basic Statistics:

Mean, Median, Standard Deviation:

For each important economic indicator, compute the mean (average), median (middle value), and standard deviation
(a measure of data spread). These statistics measure variability and a central trend in your data.

Visualize Trends Over Time:

Line Charts or Time-Series Plots:

  • Plot each economic statistic on a monthly basis from January 2013 to December 2019.
  • Recognize patterns, peaks, and troughs. This graphical format aids comprehension of the general behavior of
    economic data.

Identify Seasonality or Patterns:

Seasonal Decomposition:

  • To find seasonal components in data, use approaches such as seasonal decomposition of time series (e.g., with Python tools such as ‘statsmodels’).
  • Look for repeating patterns or cycles at regular intervals.

Feature Engineering:

Create New Features:

Calculate Growth Rate:

  • Develop a new capability for calculating the growth rate of economic indicators. This can be expressed as a
    percentage change from one period to the next.
  • Growth rates can reflect the pace at which economic indices rise or fall.

Aggregating Data:

  • To acquire a higher-level picture, aggregate the data by quarters or years.
  • This can aid in smoothing out data noise and identifying long term patterns.

Correlation Analysis:

Explore Relationships:

Correlation Matrices:

  • Determine the correlation coefficients (e.g., Pearson and Spearman) between two economic indices.
  • A correlation matrix can assist in comprehending the strength and direction of correlations. A near to 1 number implies a high positive connection, -1 is a strong negative correlation, and 0 is no correlation.

Scatter Plots:

  • Create scatter plots for pairs of economic indicators to inspect their relationships visually.
  • Looking for linear or non-linear patterns in the scatter plots.

Week-10 (11/20) Analysing new Boston data set.

The data may be seen at https://data.boston.gov/dataset/food-establishment-inspections/resource/4582bec6-2b4f-4f9e-bc55-cbaa73117f4c. is an invaluable resource for a Boston-based project focusing on food industry inspections. This dataset provides detailed information regarding inspections undertaken by the city’s health department, including the name and location of establishments, inspection results, violation descriptions, and corrective measures taken. With this information, your project may investigate patterns and trends in food safety in the city, suggesting areas for improvement and perhaps adding to the community’s general well-being.

You may examine the data for the project to learn about the most prevalent types of infractions, the distribution of inspection scores across various neighborhoods, or trends over time. Exploring this information may allow you to find locations with larger breaches, allowing local authorities to concentrate their resources better and improve food safety policies. Furthermore, this research might contribute to the greater conversation about public health and safety by providing concrete recommendations for improving the overall quality of Boston’s food outlets.

A strategic strategy must be developed before proceeding with the project employing the food establishment inspection information. First and foremost, an exploratory data analysis (EDA) is required to get a thorough grasp of the dataset’s structure and content. Examining important variables, looking for missing or conflicting data, and spotting potential trends are all part of this process.

Consider specifying the project’s particular objectives after the EDA. This might include developing research questions such as identifying the most prevalent infractions, examining trends over time, or finding the relationship between inspection outcomes and geographical areas. Setting specific goals will guide the project’s future stages.

Once objectives have been established, sophisticated analytics or machine learning approaches should be implemented, if applicable and appropriate. Predictive modeling to identify locations at increased risk of breaches, clustering analysis to discover similar patterns among businesses, or time-series analysis to detect trends over certain periods might all be used.

Maintain an emphasis on data visualization throughout the project to effectively communicate results. Use graphs, charts, and maps to show information clearly and understandably. Finally, establish project team collaboration and communication to mix varied experiences and viewpoints. Regular checkpoints and updates will ensure a smooth advancement toward the project’s objectives.

Week-10 (11/17) Analysing Time Series Model.

ARIMA Model

ARIMA, which stands for Autoregressive Integrated Moving Average, is a time series forecasting statistical approach. It is a well-known and effective method for modeling and forecasting time-dependent data. ARIMA is made up of three major components: Autoregression (AR), Integration (I), and Moving  Average (MA).

Autoregression (AR)

Modeling the link between one observation and numerous lagged observations (prior time steps) is the autoregressive component.
The number of lag observations included in the model is represented by the “p” parameter in ARIMA, abbreviated as AR(p). It represents the amount of time steps in the past that the model utilizes to forecast the current time step.

Integration (I)

To make the time series data steady, the integration component differencing it. Stationarity indicates that a time series’ statistical features, such as mean and variance, stay constant across time.
The I(d) differencing parameter reflects the number of times differencing is conducted to achieve stationarity.

Moving Average (MA)

Modeling the link between an observation and a residual error from a moving average model applied to lagged data is part of the moving average component.
The moving average order is represented by the “q” parameter in ARIMA, written as MA(q). It determines how many lag forecast mistakes are incorporated into the model.

When all of this is considered, the ARIMA model is frequently expressed as ARIMA(p, d, q), where:

  • p: The order of the autoregressive component.
  • d: The degree of differencing.
  • q: The order of the moving average component.

Used randomly generated data to understand the working

The graph depicts a time series dataset and the forecast produced by an ARIMA model. The observed time series data, which is a cumulative total of randomly produced normal values, is shown by the blue line. The figure depicts the underlying trend and volatility in the synthetic data visually. The red line depicts the ARIMA model’s forecast for the following ten steps. The ARIMA model captures patterns and dependencies in the data, creating a prediction that attempts to predict future values based on previous patterns (p=1, d=1, q=1). The graph depicts the use of ARIMA for time series forecasting, showing both the actual data and the model’s anticipated values during the selected forecast period.

Week-10 (11/15) Time Series Model for Boston data.

Time Series Data Analysis for Boston data set.

Time series data refers to a type of data where the values are observed or recorded at different points in time. This data is ordered chronologically and is often used to analyze trends, patterns, and behaviors over time. In the context of the dataset you provided (Certified Business Directory from the city of Boston), time series analysis might involve examining how certain attributes or characteristics of businesses change over time.

Steps for Time Series Analysis:

  1. Date or Time Attribute: Checking if the dataset includes a column that represents the time or date when each entry was recorded. In the case of the Certified Business Directory, this could be the date when a business was certified or some other relevant date.
  2. Data Exploration: Exploring the dataset to understand the structure and content. Look for columns that might be relevant to your analysis, such as business types, locations, and any numeric values that might vary over time.
  3. Trends and Patterns: Use time series to plots, such as line charts, to visualize how specific attributes change over time. For example, you might want to see how the number of certified businesses has changed over the years.
  4. Seasonality and Cyclic Patterns: Checking for seasonality or cyclic patterns. Certain businesses may experience variations based on the time of the year. This could be relevant for businesses like tourism-related services that might see increased activity during certain months.
  5. Data Cleaning: Ensure that the time-related data is in a consistent format and handle any missing or anomalous values.
  6. Statistical Analysis: Use statistical methods to identify patterns, correlations, or anomalies in the time series data. This could involve calculating averages, identifying outliers, or using more advanced statistical techniques.
  7. Forecasting: Time series analysis can also be used for forecasting future values based on historical data. This might involve using techniques like ARIMA (AutoRegressive Integrated Moving Average) or machine learning models for time series forecasting.

Week-10 (11/13) Time Series Model.

To find patterns, trends, and underlying structures, time series data analysis entails looking over and analyzing data points gathered over an extended period. Including signal processing, finance, economics, and many more, use time series data.
The following is a summary of the main procedures and methods used in time series data analysis:

  1. Data Collection:
    • Collect pertinent time series data, ensuring the intervals between observations are regular and uniform.
    • Take care of missing numbers, outliers, and inconsistencies to guarantee the quality of your data.
  2. Exploratory Data Analysis (EDA):
    • To summarize the mean, variance, and distribution of the time series, perform descriptive statistics.
    • Use graphical tools such as line plots and histograms to visually represent the time series and detect anomalies, patterns, or seasonality.
  3. Time Series Decomposition:
    • Break the time series down into its parts, which are usually trend, seasonality, and residual/error.
      Decomposition aids in identifying anomalies and comprehending underlying patterns.
  4. Statistical Models:
    • Utilize statistical models to identify and measure various time series components.
      Seasonal breakdown of time series (STL), autoregressive integrated moving average (ARIMA), and exponential smoothing techniques are examples of common models.
  5. Machine Learning Models:
    • For more intricate time series forecasting, use machine learning methods like neural networks, random forests, and support vector machines.
    • In this case, feature engineering and model validation are essential.
  6. Time Series Clustering:
    • Using clustering algorithms, you may more easily identify patterns and find anomalies by assembling comparable time series patterns.
  7. Cross-Validation:
    • Make sure your models perform well on unknown data by validating them using methods like k-fold cross-validation.
  8. Real-Time Monitoring:
    • To adjust to evolving trends, use real-time monitoring for continuing analysis and changes.

Time Series Model Fit

There isn’t a dedicated method named TimeSeriesModelFit in commonly used time series analysis libraries or packages, such as those in R (like forecast or stats) or Python (like statsmodels or scikit-learn). But, it’s conceivable that more recent packages or features have been added since then, or it can be a unique feature or technique reserved for a certain piece of software or equipment.

Week-9 (11/10) Working on Decision Tree.

A decision tree is a popular machine-learning algorithm for classification and regression tasks. In the context of classification, a decision tree is a tree-like model where each internal node represents a decision based on the value of a particular feature, each branch represents the outcome of the decision, and each leaf node represents the final class label.

Decision Tree

There is an in-built library called “sklearn” in that we call the function named DecisionTreeClassifier
While I was working on a dummy data set from sklearn, meaning they have their own data set which is available on their website called load_iris
Important Parameters in DecisionTreeClassifier:

  1. criterion: The function to measure the quality of a split (e.g., “gini” for Gini impurity or “entropy” for information gain).
  2. max_depth: The maximum depth of the tree.
  3. min_samples_split: The minimum number of samples required to split an internal node.
  4. min_samples_leaf: The minimum number of samples required at a leaf node.

How Decision Trees Work

  1. Decision Nodes:
    • Each internal node tests a specific attribute or feature.
    • The decision is based on the value of the feature.
  2. Branches:
    • Each branch represents the outcome of the decision.
    • Branches are labeled with the possible values of the decision attribute.
  3. Leaf Nodes:
    • Leaf nodes represent the final decision or class label.
    • Each leaf node is associated with a class label.
  4. Splitting:
    • The tree is built by recursively splitting the dataset based on the selected features.
    • The goal is to create pure leaf nodes with samples belonging to a single class.
  5. Stopping Criteria:
      • The tree-building process stops when a certain criterion is met (e.g., maximum depth reached or minimum samples in a leaf).

Decision trees are simple to use and can handle both numerical and categorical data. They are, however, prone to overfitting, especially when the tree depth is not well managed. Pruning and establishing a limited depth might assist in reducing overfitting.

I used the data set which I mentioned earlier
and the output is visualized something like this

The image generated for visualizing the decision tree provides a graphical representation of the structure and decision-making process of the trained model.

The image generated by visualizing the decision tree provides a graphical representation of the structure and decision-making process of the trained model. Let’s break down what the image is telling us:

  1. Root Node:
    • The root node is the tree’s top node. It indicates the initial decision’s characteristics and threshold. This choice is based on a certain property, and the tree branches out from that point based on the different values of that feature.
  2. Internal Nodes:
    • Internal nodes are those in the center of the tree. Each internal node indicates a choice made on the basis of a given characteristic and threshold. “Is feature X greater than feature Y?” is a common question.
  3. Leaf Nodes:
    • The leaf nodes are the terminal nodes at the bottom of the tree. A predicted class is represented by each leaf node. The anticipated class for instances that reach the leaf node is the majority class in that node.
  4. Edges (Branches):
    • The results of the decisions are represented by the edges linking the nodes. For example, if a node’s decision is true, you follow the left branch; if it is false, you follow the right branch.
  5. Feature Names and Thresholds:
    • At each decision node, the names of the characteristics and the threshold values used for splitting are presented. This information assists us in comprehending the circumstances in which the decision tree makes decisions.
  6. Class Names:
    • The leaf nodes are labeled with the names of the classes if you specify class_names while drawing the tree. This makes interpreting the final anticipated classes simple.

We can understand how the model partitions the feature space and produces predictions based on the input features by looking at the decision tree visualization. It’s a useful tool for deciphering the model’s inner workings and determining the most significant characteristics in the decision-making process.

Week-9 (11/08) Decision Tree.

Decision Tree

Decision tree classification is a common machine learning approach that may be used to categorize data. It is based on the idea of a decision tree, which is a hierarchical framework for making decisions. Decision trees may be conceived of in the context of statistics as a means to represent and analyze the connections between distinct variables in a dataset.

Decision Tree Classification Algorithm » DevOps

Steps that we are using in our project

  1. Data Preparation: Begin with a dataset with a collection of characteristics (independent variables) and a target variable (the variable to predict or classify). In statistics, features are equivalent to predictor variables, while the response is the target variable.
  2. Splitting the Data: Separate the dataset into training and testing. The decision tree is built using the training set, and its performance is evaluated using the testing set.
  3. Building the Decision Tree: In Python, we can use libraries like sci-kit-learn to create a decision tree classifier. The algorithm recursively splits the data into subsets based on the feature that provides the best split according to a certain criterion.
  4. Node Splitting: The method finds the characteristic that best divides the data into various groups at each node of the tree. The objective is to reduce impurity while increasing information acquisition. In statistical terms, this method is analogous to picking the variable with the most significant information for classification.
  5. Leaf Nodes: The decision tree splits the data until a stopping requirement, such as a maximum depth or a minimum number of samples per leaf node, is reached. The ultimate categorization choice is represented by the terminal nodes, also known as leaf nodes. Each leaf node holds the majority of the data points that arrive at it.
  6. Model Evaluation: After creating the decision tree, you may analyze its performance using the testing dataset. In statistics, common assessment criteria include accuracy, precision, recall, the F1-score, and the confusion matrix. These metrics assist in determining how successfully the model classifies the data.
  7. Visualization: To offer an interpretable picture of the categorization rules, decision trees can be shown. To plot and display the decision tree in Python, packages such as sci-kit-learn may be used.

Pros and Cons of Decision Tree

Pros

    1. Interpretability: Decision trees describe the decision-making process in a highly interpretable and visual manner. They are simple to grasp, making them an excellent choice for explaining the concept to non-technical stakeholders.
    2. No Assumptions about Data: Decision trees make no strong assumptions about the distribution of the underlying data. They can handle numerical and categorical data, as well as missing values using proper algorithms.
    3. Feature Selection: Decision trees do feature selection on their own by selecting the most informative characteristics for splitting nodes. This might assist you in determining the most significant variables in your dataset.

Cons

  1. Overfitting: Overfitting is common in decision trees, especially when the tree is deep and complicated. This means that the model may perform well on training data but badly on fresh, previously unknown data.
  2. Instability: Small changes in data can result in drastically different tree architectures. Because of this instability, decision trees may be less dependable than alternative models.
  3. Bias Toward Dominant Classes: In classification tasks with imbalanced classes, decision trees may be biased towards the majority class, leading to poor classification of the minority class.