Module 05_Assignment
.docx
keyboard_arrow_up
School
Kennesaw State University *
*We aren’t endorsed by this school
Course
6933
Subject
Computer Science
Date
May 2, 2024
Type
docx
Pages
1
Uploaded by BaronApe4233 on coursehero.com
Assignment
(100 Points)
IT6933
Machine Learning Technology in FinTech Module 05: K-Nearest Neighbors & Support Vector Machine
In this homework, we explore Naïve Bayes, K-Nearest Neighbors, and Support Vector Machine models.
1)
(50 points) Use “credit_Dataset.arff” dataset and apply the Naïve Bayes, K-
Nearest Neighbors, and Support Vector Machine technique using the WEKA tool in 2 different settings, including:
a.
10 fold-cross validation.
b.
80% training.
Write a short paragraph about your findings and compare the results (accuracy). Use a table or a bar chart graph (MS Excel) to visualize the results.
2)
(25 points) Use “credit_Dataset.arff” dataset and apply the K-Nearest Neighbors with different K values (1, 3, 5, 15). Visualize the results corresponding to different K values using a bar chart graph.
3)
(25 points) Use “credit_Dataset.arff” dataset and apply the Support Vector Machine using three different Kernels. Write a short paragraph about your findings and compare the results.
Deliverable:
• Your report including the screenshots of your implementation and the result.
Discover more documents: Sign up today!
Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Write a program to create the following simulated dataset
a) make_regression: For linear regression by generating and printing features matrix, target vector, and
the true coefficients
b) make_classification: For classification by generating and printing features matrix and target vector
c) make_blobs: For clustering techniques by generating and printing features matrix, target vector and
plot the scatterplot
Save as Assignmentl_Q2.py
arrow_forward
from sklearn.linear_model import LogisticRegression
From this library give me example code for logistic regression to prodect the last variable from row of csv file
arrow_forward
Implement a basic binary logistic regression classifier. It should perform the
following operations:
• Train model
• Predict on testing data
Assume the training and testing data is in the form of 2D matrix and the
class labels is a 1-D array.
arrow_forward
Question 47. Random forests are one of the most famous machine learning methods. They are easy to understand,
easy to implement and reach good prediction performances even without a hyper-parameter tuning. Which of the
following statements on random forest are correct?
a) The prediction of a classification forest is made by a majority vote of the trees' predictions.
b) The prediction of a regression forest is the median of the tree predictions.
c) Each single tree in the forest uses only a part of the data available.
d) The training time of a random forest scales linear with the number of trees used.
arrow_forward
Sum of Squared Errors: Remember from your statistics courses that if two
random variables X and Y are related by a relation YaX+b and you had a set of observations
{(1, 1), (2, 2)..... (z.)). then for every estimated values of a and b, the Sum of Squared Errors
(SSE) was defined as in the following formula.
SSE(-ar-b)².
1-1
The less SSE, the better a and b are estimated. Consider the following fixed list. In this list each
sub-list of length two is standing for one pair (z.). Write a function that recieves two numbers a
and b and returns the associated SSE.
L- [[1, 2], [1.1, 2], [2, 7.1), (2.5, 7.21, (3, 11]]
arrow_forward
+
ill
You're the ChiefData Science Officer at a large bank. You've instructed your team to experiment
with using payment data for marketing purposes, predicting which customer might be interested
in a golf tournament that the bank sponsors. So the data instances correspond to customers, and
the features are unique account numbers. Your newly hired team is ready to shine and has put
quite some effort in building a linear model, where each ac-count number that one can pay to is
given a coefficient. The prediction model hence predicts interest based on whom the customer
has made payments to. They proudly report to you that the accuracy of their model is 95%, on a
test set chosen in January.
1. What further questions would you ask on the evaluation? Think of test data, metrics, and
baselines.
2. What would be potential privacy risks related to re-identification or the revelation of sensitive
information of customers to the data science team? How to measure these?
3. Might there be…
arrow_forward
1. In multidimensional data analysis, it is interesting to extract pairs of similar cell characteristics associated
with substantial changes in measure in a data cube, where cells are considered similar if they are related
by roll-up (ie. Ancestors), drill-down (ie. Descendants), or 1-dimensional mutation (ie, siblings) operations.
Such an analysis is called cube gradient analysis. Suppose the measure of the cube is average. A user
poses a set of probe cells and would like to find their corresponding sets of gradient cells, each of which
satisfies a certain gradient threshold. For example, find the set of corresponding gradient cells whose
average sale price is greater than 20% of that of the given probe cells. Develop an algorithm than mines
the set of constrained gradient cells efficiently in a large data cube.
arrow_forward
import numpy as np
import pandas as pd
from catboost
import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
df=pd.read_csv('data.csv')
X = df.drop('shares', axis=1)
y = df['shares']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.40, random_state=13)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=13)
Ans:- # code here
Q- Now let's train our first model - XGBoost. A link to the documentation: https://xgboost.readthedocs.io/en/latest/
We will use Scikit-Learn Wrapper interface for XGBoost (and the same logic applies to the following LightGBM and CatBoost models). Here, we work on the regression task - hence we will use XGBRegressor. Read…
arrow_forward
Write a program to create the following simulated dataset
a) make_regression: For linear regression by generating and printing features matrix, target vector, and
the true coefficients
b) make_classification: For classification by generating and printing features matrix and target vector
c) make_blobs: For clustering techniques by generating and printing features matrix, target vector and
plot the scatterplot
arrow_forward
(K-Nearest Neighbors) Please show the detailed process of using K Nearest Neighbor classifier to predict the test instance X= (Speed = 5.20, Weight = 500) is qualified or not, by setting k = 1, 3, and 5, respectively
Before using KNN classifier, please use Min-max normalization (KNN.pdf page 17) to preprocess the attribute values and plot the preprocessed training data set on a 2d plane (Speed - X axis and Weight - Y axis, the instances of class no are labeled by − and the instances of class yes are labeled by + in the plot)
arrow_forward
Kindly explain the rationale for the use of inferential statistics.
arrow_forward
question 2) what is the difference between correlation and convolution filtering methods? Briefly explain with a (your own) suitable example
arrow_forward
1. Implement the Forward algorithm in Python. Your task is to write code for the following prototype:
def Forward(A,B,Pi,O) :
Evaluation of Discrete Models
Instructions
return forward Probability # P(O|2)
Test this implementation using the hand-calculated results for question 1. The results produced by
your Python code must match the hand-calculated results.
Create
your own 3 observation vector and calculate P(O|^) by hand. Test your forward algorithm
implementation using this vector. The results produced by your code must match the hand-
calculated results.
4. Create an observation vector of at least 7 elements. Test your forward algorithm using
this vector.
2.
3.
arrow_forward
Instructions: In the next exercise, you will use a known dataset used as an example for classification problems.
Refer to: Index of /ml/machine-learning-databases/balance-scale (uci.edu).
The problem has 3 classes and 4 input features. The possible classes are (B: balanced, L: left, R: right), and the 4 input features are: (LW: left weight, RW: right weight, LD: left distance, RD: Right distance). The balance shows up the value ‘B’ when LW + LD = RD + RW.
Each raw of the file data is constructed as the following: Class, LW, LD, RW, RD.
Example: B, 3, 2, 4, 1. In addition, all the input features might have values 1, ..., 5. For more information refer to website given before.
Remark: you are free to propose a preprocessing of the input features, also you’re advised to try different values of the learning rate and observe the training curve.
Remark: in your code, you are asked to use python, numpy and matplotlib. You shouldn’t use libraries like scikit, tensorflow, pytorch, or any other…
arrow_forward
TODO: Lienar Regression with least Mean Squares (LMS)
Optimize the model through gradient descent. *Please complete the TODOs. *
!pip install wget
import osimport randomimport tracebackfrom pdb import set_traceimport sysimport numpy as npfrom abc import ABC, abstractmethodimport traceback
from util.timer import Timerfrom util.data import split_data, feature_label_split, Standardizationfrom util.metrics import msefrom datasets.HousingDataset import HousingDataset
class BaseModel(ABC): """ Super class for ITCS Machine Learning Class"""
@abstractmethod def fit(self, X, y): pass
@abstractmethod def predict(self, X): pass
class LinearModel(BaseModel): """ Abstract class for a linear model
Attributes ========== w ndarray weight vector/matrix """
def __init__(self): """ weight vector w is initialized as None """ self.w = None
# check if the matrix is 2-dimensional. if…
arrow_forward
Explain how random forest is used for similarity measure.
arrow_forward
Which of the following statements is true?
Group of answer choices
When clustering, we want to put two dissimilar data objects into the same cluster
Clustering is among unsupervised learning models since it does not require a target variable
Clustering is among unsupervised learning models since it requires a target variable
When clustering, we want to put data objects into a pre-labeled target variable.
arrow_forward
Coding problems:
HW12_3: Solve the ODE from 12_1 above with a MATLAB code implementing both Euler's method and midpoint
method with a step size of 0.1. Also display the results from both methods on the same plot along with the
analytical result
y(t) = e(t³-3.3t)/3
Use proper legend with your plot.
arrow_forward
Objective #1:Implement a function in Matlab that finds that parameters, b_hat, of a polynomial regression model.Begin from 'regress_fit_poly.m', which is a stub (i.e., unfinished) version of the function provided foryou. The inputs and outputs of your function should conform to the following specifications:
% Inputs:% x - A n-by-1 vector of feature values% (where n is number of data points)% y - A n-by-1 vector of response variable values% p - A scalar value, indicating the polynomial order
% Outputs:% b_hat - a p+1-by-1 vector of regression coefficients
Note: Your function should be able to calculate the polynomial regression parameters for a model of anyorder (i.e., an input ‘p’ of any value).
Note: To see if your function is working correctly, you can check the outputs of your function againstthose produced by Matlab's 'polyfit' function. However, you should not call ‘polyfit’ inside your ownfunction.
Note: Pay special attention to the order of the parameters, which is important,…
arrow_forward
We use the Breast Cancer Wisconsin dataset from UCI machine learning repository: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 Data File: breast-cancer-wisconsin.data (class: 2 for benign, 4 for malignant) Data Metafile: breast-cancer-wisconsin.names Please implement this algorithm for logistic regression (i.e., to minimize the cross-entropy loss as discussed in class), and run it over the Breast Cancer Wisconsin dataset. Please randomly sample 80% of the training instances to train a classifier and then testing it on the remaining 20%. Ten such random data splits should be performed and the average over these 10 trials is used to estimate the generalization performance. You are expected to do the implementation all by yourself so you will gain a better understanding of the method. Please submit: (1) your source code (or Jupyter notebook file) that TA should be able to (compile and) run, and the…
arrow_forward
1 Change this code from Matlab to Phython:
응용웅
function plotDecisionBoundary (theta, X, y)
%PLOTDECISIONBOUNDARY Plots the data points X and y into a new figure with
%the decision boundary defined by theta
3
6.
PLOTDECISIONBOUNDARY (theta, X, y) plots the data points with + for the
positive examples and o for the negative examples. X is assumed to be
a either
1) Mx3 matrix, where the first column is an all-ones column for the
intercept.
2) MXN, N>3 matrix, wh
% Plot Data
9.
10
11
re the first column is all-ones
12
13 plotData (X (:, 2:3), y);
14
hold on
15
if size (X, 2) <= 3
16
% only need 2 points to define a line, so choose two endpoints
17
[min (X (:, 2))-2, max (X (:, 2))+2];
% Calculate the decision boundary line
plot_y = (-1./theta (3)). * (theta (2).*plot_x + theta (1));
* Plot, and adjust axes for better viewing
plot (plot_x, plot_y)
% Legend, specific for the exercise
legend ('Admitted', 'Not admitted', 'Decision Boundary')
axis ([30, 100, 30, 100])
18
19
22
23
24
25
else
%…
arrow_forward
Classify the 1’s, 2’s, 3’s for the zip code data in R. (a) Use the k-nearest neighbor classification with k = 1, 3, 5, 7, 15. Report both the training and test errors for each choice. (b) Implement the LDA method and report its training and testing errors. Note: Before carrying out the LDA analysis, consider deleting variable 16 first from the data, since it takes constant values and may cause the singularity of the covariance matrix. In general, a constant variable does not have a discriminating power to separate two classes.
arrow_forward
Python
You can import any data you want.
Using default hyperparameters:
1. Construct Naive Bayes (NB) models on the training set.
2. Calculate the confusion matrix and report the following performance metrics on the training set: Accuracy, F1 Score, AUC, Sensitivity, Specificity, and Precision.
You can use the function p1_metrics for this purpose.
3. Calculate the same metrics by applying the trained model to the validation set. Compare and contrast the errors each model makes in terms of each class.
#peformance metric functions
from sklearn.metrics import confusion_matrix, roc_auc_score, f1_score
import numpy as np
#A list of keys for the dictionary returned by p1_metrics
metric_keys = ['auc','f1','accuracy','sensitivity','specificity', 'precision']
def p1_metrics(y_true,y_pred,include_cm=True):
cm = confusion_matrix(y_true,y_pred)
tn, fp, fn, tp = cm.ravel()
if include_cm:
return {
'auc': roc_auc_score(y_true,y_pred),
'f1': f1_score(y_true,y_pred),
'accuracy':…
arrow_forward
Below is a detailed discussion of the advantages of all-subsets regression versus stepwise regression.
arrow_forward
Use Boston Hosing data, which is called
"Boston" part of MASS library.
o library(MASS)
o library(ISLR)
Questions:
Estimate “medv" (median value of owner
occupied homes) as a function of all the
variables
o with Lasso Equalizer – choose the best
lambda and print list of coef of predictors
o With Ridge Regression or L2-norm – choose
lambda and parameters – choose the best
lambda and print list of coef of predictors
arrow_forward
Algorithm for LLP-GAN training algorithmInput: The training set L = {(Bi, pi)}n i=1; L: number of total iterations; λ: weight parameter.Input: The parameters of the final discriminator D.Set m to the total number of training data points
arrow_forward
1
2
3
95
108
110
126
118
102
124
121
145
118
140
155
185
158
8
190
178
9
205
159
10
222
184
The table shows scores of 10 students in Java programming and
Data Science
The estimated Spearman's rank-order coefficient of correlation p
is:
A. r=0.842
B.
r = -0.842
C. r= 0.158
D. r=-0.158
5
Student
6
Java Programming
Data Science
arrow_forward
3. PlantGrowth is a dataset contained in R. You can refer to the R help document for its
information. The following code performs a t-test to compare two vectors, ctrl (the
plant yield in the control group) and trt2 (the plant yield in the treatment group). Run
the code and interpret the output (one/two sample test? Paired or unpaired ? one/two
sided test? Write the hypothesis, read pvalue -> conclusion ,etc... ) assuming
significance threshold 0.05
data("PlantGrowth")
ctrl = PlantGrowth$weight[PlantGrowth$group =="ctrl"]
PlantGrowth$weight[PlantGrowth$group==="trt2"]
trt2 =
t.test(ctrl, trt2)
arrow_forward
: In decision trees, attribute selection techniques decide the goodnessof a nominal attribute based on the purity of its relevant partitions. Each partitioncontains only tuples that have the same value of that attribute. Explain howattribute selection techniques deal with numeric attributes?
arrow_forward
Write a r programming code for logistic regression and calculate these classification metrics: contusion matrix, accuracy, precision, recall, sensitivity, specificity, F1 score and matthew correlation coefficient.
arrow_forward
Which statement best describes k-means cluster analysis?
It is the process of estimating the value of a continuous outcome variable.
It is the process of organizing observations into distinct groups based on a
measure of similarity or dissimilarity.
It is the process of reducing the number of variables to consider in data-mining.
It is the process of agglomerating observations into a series of nested groups
based on a measure of similarity or dissimilarity.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education
Related Questions
- Write a program to create the following simulated dataset a) make_regression: For linear regression by generating and printing features matrix, target vector, and the true coefficients b) make_classification: For classification by generating and printing features matrix and target vector c) make_blobs: For clustering techniques by generating and printing features matrix, target vector and plot the scatterplot Save as Assignmentl_Q2.pyarrow_forwardfrom sklearn.linear_model import LogisticRegression From this library give me example code for logistic regression to prodect the last variable from row of csv filearrow_forwardImplement a basic binary logistic regression classifier. It should perform the following operations: • Train model • Predict on testing data Assume the training and testing data is in the form of 2D matrix and the class labels is a 1-D array.arrow_forward
- Question 47. Random forests are one of the most famous machine learning methods. They are easy to understand, easy to implement and reach good prediction performances even without a hyper-parameter tuning. Which of the following statements on random forest are correct? a) The prediction of a classification forest is made by a majority vote of the trees' predictions. b) The prediction of a regression forest is the median of the tree predictions. c) Each single tree in the forest uses only a part of the data available. d) The training time of a random forest scales linear with the number of trees used.arrow_forwardSum of Squared Errors: Remember from your statistics courses that if two random variables X and Y are related by a relation YaX+b and you had a set of observations {(1, 1), (2, 2)..... (z.)). then for every estimated values of a and b, the Sum of Squared Errors (SSE) was defined as in the following formula. SSE(-ar-b)². 1-1 The less SSE, the better a and b are estimated. Consider the following fixed list. In this list each sub-list of length two is standing for one pair (z.). Write a function that recieves two numbers a and b and returns the associated SSE. L- [[1, 2], [1.1, 2], [2, 7.1), (2.5, 7.21, (3, 11]]arrow_forward+ ill You're the ChiefData Science Officer at a large bank. You've instructed your team to experiment with using payment data for marketing purposes, predicting which customer might be interested in a golf tournament that the bank sponsors. So the data instances correspond to customers, and the features are unique account numbers. Your newly hired team is ready to shine and has put quite some effort in building a linear model, where each ac-count number that one can pay to is given a coefficient. The prediction model hence predicts interest based on whom the customer has made payments to. They proudly report to you that the accuracy of their model is 95%, on a test set chosen in January. 1. What further questions would you ask on the evaluation? Think of test data, metrics, and baselines. 2. What would be potential privacy risks related to re-identification or the revelation of sensitive information of customers to the data science team? How to measure these? 3. Might there be…arrow_forward
- 1. In multidimensional data analysis, it is interesting to extract pairs of similar cell characteristics associated with substantial changes in measure in a data cube, where cells are considered similar if they are related by roll-up (ie. Ancestors), drill-down (ie. Descendants), or 1-dimensional mutation (ie, siblings) operations. Such an analysis is called cube gradient analysis. Suppose the measure of the cube is average. A user poses a set of probe cells and would like to find their corresponding sets of gradient cells, each of which satisfies a certain gradient threshold. For example, find the set of corresponding gradient cells whose average sale price is greater than 20% of that of the given probe cells. Develop an algorithm than mines the set of constrained gradient cells efficiently in a large data cube.arrow_forwardimport numpy as np import pandas as pd from catboost import CatBoostRegressor from lightgbm import LGBMRegressor from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split from xgboost import XGBRegressor df=pd.read_csv('data.csv') X = df.drop('shares', axis=1) y = df['shares'] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.40, random_state=13) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=13) Ans:- # code here Q- Now let's train our first model - XGBoost. A link to the documentation: https://xgboost.readthedocs.io/en/latest/ We will use Scikit-Learn Wrapper interface for XGBoost (and the same logic applies to the following LightGBM and CatBoost models). Here, we work on the regression task - hence we will use XGBRegressor. Read…arrow_forwardWrite a program to create the following simulated dataset a) make_regression: For linear regression by generating and printing features matrix, target vector, and the true coefficients b) make_classification: For classification by generating and printing features matrix and target vector c) make_blobs: For clustering techniques by generating and printing features matrix, target vector and plot the scatterplotarrow_forward
- (K-Nearest Neighbors) Please show the detailed process of using K Nearest Neighbor classifier to predict the test instance X= (Speed = 5.20, Weight = 500) is qualified or not, by setting k = 1, 3, and 5, respectively Before using KNN classifier, please use Min-max normalization (KNN.pdf page 17) to preprocess the attribute values and plot the preprocessed training data set on a 2d plane (Speed - X axis and Weight - Y axis, the instances of class no are labeled by − and the instances of class yes are labeled by + in the plot)arrow_forwardKindly explain the rationale for the use of inferential statistics.arrow_forwardquestion 2) what is the difference between correlation and convolution filtering methods? Briefly explain with a (your own) suitable examplearrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Database System ConceptsComputer ScienceISBN:9780078022159Author:Abraham Silberschatz Professor, Henry F. Korth, S. SudarshanPublisher:McGraw-Hill EducationStarting Out with Python (4th Edition)Computer ScienceISBN:9780134444321Author:Tony GaddisPublisher:PEARSONDigital Fundamentals (11th Edition)Computer ScienceISBN:9780132737968Author:Thomas L. FloydPublisher:PEARSON
- C How to Program (8th Edition)Computer ScienceISBN:9780133976892Author:Paul J. Deitel, Harvey DeitelPublisher:PEARSONDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781337627900Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningProgrammable Logic ControllersComputer ScienceISBN:9780073373843Author:Frank D. PetruzellaPublisher:McGraw-Hill Education
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education