Skip to main content

Documents Computer Science

Module 05_Assignment.docx

Module 05_Assignment

.docx

School

Kennesaw State University *

*We aren’t endorsed by this school

Course

6933

Subject

Computer Science

Date

May 2, 2024

Type

docx

Pages

1

Uploaded by BaronApe4233 on coursehero.com

Assignment (100 Points) IT6933 Machine Learning Technology in FinTech Module 05: K-Nearest Neighbors & Support Vector Machine In this homework, we explore Naïve Bayes, K-Nearest Neighbors, and Support Vector Machine models. 1) (50 points) Use “credit_Dataset.arff” dataset and apply the Naïve Bayes, K- Nearest Neighbors, and Support Vector Machine technique using the WEKA tool in 2 different settings, including: a. 10 fold-cross validation. b. 80% training. Write a short paragraph about your findings and compare the results (accuracy). Use a table or a bar chart graph (MS Excel) to visualize the results. 2) (25 points) Use “credit_Dataset.arff” dataset and apply the K-Nearest Neighbors with different K values (1, 3, 5, 15). Visualize the results corresponding to different K values using a bar chart graph. 3) (25 points) Use “credit_Dataset.arff” dataset and apply the Support Vector Machine using three different Kernels. Write a short paragraph about your findings and compare the results. Deliverable: • Your report including the screenshots of your implementation and the result.

Discover more documents: Sign up today!

Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Related Questions

Write a program to create the following simulated dataset a) make_regression: For linear regression by generating and printing features matrix, target vector, and the true coefficients b) make_classification: For classification by generating and printing features matrix and target vector c) make_blobs: For clustering techniques by generating and printing features matrix, target vector and plot the scatterplot Save as Assignmentl_Q2.py

from sklearn.linear_model import LogisticRegression From this library give me example code for logistic regression to prodect the last variable from row of csv file

Implement a basic binary logistic regression classifier. It should perform the following operations: • Train model • Predict on testing data Assume the training and testing data is in the form of 2D matrix and the class labels is a 1-D array.

Question 47. Random forests are one of the most famous machine learning methods. They are easy to understand, easy to implement and reach good prediction performances even without a hyper-parameter tuning. Which of the following statements on random forest are correct? a) The prediction of a classification forest is made by a majority vote of the trees' predictions. b) The prediction of a regression forest is the median of the tree predictions. c) Each single tree in the forest uses only a part of the data available. d) The training time of a random forest scales linear with the number of trees used.

Sum of Squared Errors: Remember from your statistics courses that if two random variables X and Y are related by a relation YaX+b and you had a set of observations {(1, 1), (2, 2)..... (z.)). then for every estimated values of a and b, the Sum of Squared Errors (SSE) was defined as in the following formula. SSE(-ar-b)². 1-1 The less SSE, the better a and b are estimated. Consider the following fixed list. In this list each sub-list of length two is standing for one pair (z.). Write a function that recieves two numbers a and b and returns the associated SSE. L- [[1, 2], [1.1, 2], [2, 7.1), (2.5, 7.21, (3, 11]]

+ ill You're the ChiefData Science Officer at a large bank. You've instructed your team to experiment with using payment data for marketing purposes, predicting which customer might be interested in a golf tournament that the bank sponsors. So the data instances correspond to customers, and the features are unique account numbers. Your newly hired team is ready to shine and has put quite some effort in building a linear model, where each ac-count number that one can pay to is given a coefficient. The prediction model hence predicts interest based on whom the customer has made payments to. They proudly report to you that the accuracy of their model is 95%, on a test set chosen in January. 1. What further questions would you ask on the evaluation? Think of test data, metrics, and baselines. 2. What would be potential privacy risks related to re-identification or the revelation of sensitive information of customers to the data science team? How to measure these? 3. Might there be…

1. In multidimensional data analysis, it is interesting to extract pairs of similar cell characteristics associated with substantial changes in measure in a data cube, where cells are considered similar if they are related by roll-up (ie. Ancestors), drill-down (ie. Descendants), or 1-dimensional mutation (ie, siblings) operations. Such an analysis is called cube gradient analysis. Suppose the measure of the cube is average. A user poses a set of probe cells and would like to find their corresponding sets of gradient cells, each of which satisfies a certain gradient threshold. For example, find the set of corresponding gradient cells whose average sale price is greater than 20% of that of the given probe cells. Develop an algorithm than mines the set of constrained gradient cells efficiently in a large data cube.

import numpy as np import pandas as pd from catboost import CatBoostRegressor from lightgbm import LGBMRegressor from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split from xgboost import XGBRegressor df=pd.read_csv('data.csv') X = df.drop('shares', axis=1) y = df['shares'] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.40, random_state=13) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=13) Ans:- # code here Q- Now let's train our first model - XGBoost. A link to the documentation: https://xgboost.readthedocs.io/en/latest/ We will use Scikit-Learn Wrapper interface for XGBoost (and the same logic applies to the following LightGBM and CatBoost models). Here, we work on the regression task - hence we will use XGBRegressor. Read…

Write a program to create the following simulated dataset a) make_regression: For linear regression by generating and printing features matrix, target vector, and the true coefficients b) make_classification: For classification by generating and printing features matrix and target vector c) make_blobs: For clustering techniques by generating and printing features matrix, target vector and plot the scatterplot

(K-Nearest Neighbors) Please show the detailed process of using K Nearest Neighbor classifier to predict the test instance X= (Speed = 5.20, Weight = 500) is qualified or not, by setting k = 1, 3, and 5, respectively Before using KNN classifier, please use Min-max normalization (KNN.pdf page 17) to preprocess the attribute values and plot the preprocessed training data set on a 2d plane (Speed - X axis and Weight - Y axis, the instances of class no are labeled by − and the instances of class yes are labeled by + in the plot)

Kindly explain the rationale for the use of inferential statistics.

question 2) what is the difference between correlation and convolution filtering methods? Briefly explain with a (your own) suitable example

1. Implement the Forward algorithm in Python. Your task is to write code for the following prototype: def Forward(A,B,Pi,O) : Evaluation of Discrete Models Instructions return forward Probability # P(O|2) Test this implementation using the hand-calculated results for question 1. The results produced by your Python code must match the hand-calculated results. Create your own 3 observation vector and calculate P(O|^) by hand. Test your forward algorithm implementation using this vector. The results produced by your code must match the hand- calculated results. 4. Create an observation vector of at least 7 elements. Test your forward algorithm using this vector. 2. 3.

Instructions: In the next exercise, you will use a known dataset used as an example for classification problems. Refer to: Index of /ml/machine-learning-databases/balance-scale (uci.edu). The problem has 3 classes and 4 input features. The possible classes are (B: balanced, L: left, R: right), and the 4 input features are: (LW: left weight, RW: right weight, LD: left distance, RD: Right distance). The balance shows up the value ‘B’ when LW + LD = RD + RW. Each raw of the file data is constructed as the following: Class, LW, LD, RW, RD. Example: B, 3, 2, 4, 1. In addition, all the input features might have values 1, ..., 5. For more information refer to website given before. Remark: you are free to propose a preprocessing of the input features, also you’re advised to try different values of the learning rate and observe the training curve. Remark: in your code, you are asked to use python, numpy and matplotlib. You shouldn’t use libraries like scikit, tensorflow, pytorch, or any other…

TODO: Lienar Regression with least Mean Squares (LMS) Optimize the model through gradient descent. *Please complete the TODOs. * !pip install wget import osimport randomimport tracebackfrom pdb import set_traceimport sysimport numpy as npfrom abc import ABC, abstractmethodimport traceback from util.timer import Timerfrom util.data import split_data, feature_label_split, Standardizationfrom util.metrics import msefrom datasets.HousingDataset import HousingDataset class BaseModel(ABC): """ Super class for ITCS Machine Learning Class""" @abstractmethod def fit(self, X, y): pass @abstractmethod def predict(self, X): pass class LinearModel(BaseModel): """ Abstract class for a linear model Attributes ========== w ndarray weight vector/matrix """ def __init__(self): """ weight vector w is initialized as None """ self.w = None # check if the matrix is 2-dimensional. if…

Explain how random forest is used for similarity measure.

Which of the following statements is true? Group of answer choices When clustering, we want to put two dissimilar data objects into the same cluster Clustering is among unsupervised learning models since it does not require a target variable Clustering is among unsupervised learning models since it requires a target variable When clustering, we want to put data objects into a pre-labeled target variable.

Coding problems: HW12_3: Solve the ODE from 12_1 above with a MATLAB code implementing both Euler's method and midpoint method with a step size of 0.1. Also display the results from both methods on the same plot along with the analytical result y(t) = e(t³-3.3t)/3 Use proper legend with your plot.

Objective #1:Implement a function in Matlab that finds that parameters, b_hat, of a polynomial regression model.Begin from 'regress_fit_poly.m', which is a stub (i.e., unfinished) version of the function provided foryou. The inputs and outputs of your function should conform to the following specifications: % Inputs:% x - A n-by-1 vector of feature values% (where n is number of data points)% y - A n-by-1 vector of response variable values% p - A scalar value, indicating the polynomial order % Outputs:% b_hat - a p+1-by-1 vector of regression coefficients Note: Your function should be able to calculate the polynomial regression parameters for a model of anyorder (i.e., an input ‘p’ of any value). Note: To see if your function is working correctly, you can check the outputs of your function againstthose produced by Matlab's 'polyfit' function. However, you should not call ‘polyfit’ inside your ownfunction. Note: Pay special attention to the order of the parameters, which is important,…

We use the Breast Cancer Wisconsin dataset from UCI machine learning repository: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 Data File: breast-cancer-wisconsin.data (class: 2 for benign, 4 for malignant) Data Metafile: breast-cancer-wisconsin.names Please implement this algorithm for logistic regression (i.e., to minimize the cross-entropy loss as discussed in class), and run it over the Breast Cancer Wisconsin dataset. Please randomly sample 80% of the training instances to train a classifier and then testing it on the remaining 20%. Ten such random data splits should be performed and the average over these 10 trials is used to estimate the generalization performance. You are expected to do the implementation all by yourself so you will gain a better understanding of the method. Please submit: (1) your source code (or Jupyter notebook file) that TA should be able to (compile and) run, and the…

1 Change this code from Matlab to Phython: 응용웅 function plotDecisionBoundary (theta, X, y) %PLOTDECISIONBOUNDARY Plots the data points X and y into a new figure with %the decision boundary defined by theta 3 6. PLOTDECISIONBOUNDARY (theta, X, y) plots the data points with + for the positive examples and o for the negative examples. X is assumed to be a either 1) Mx3 matrix, where the first column is an all-ones column for the intercept. 2) MXN, N>3 matrix, wh % Plot Data 9. 10 11 re the first column is all-ones 12 13 plotData (X (:, 2:3), y); 14 hold on 15 if size (X, 2) <= 3 16 % only need 2 points to define a line, so choose two endpoints 17 [min (X (:, 2))-2, max (X (:, 2))+2]; % Calculate the decision boundary line plot_y = (-1./theta (3)). * (theta (2).*plot_x + theta (1)); * Plot, and adjust axes for better viewing plot (plot_x, plot_y) % Legend, specific for the exercise legend ('Admitted', 'Not admitted', 'Decision Boundary') axis ([30, 100, 30, 100]) 18 19 22 23 24 25 else %…

Classify the 1’s, 2’s, 3’s for the zip code data in R. (a) Use the k-nearest neighbor classiﬁcation with k = 1, 3, 5, 7, 15. Report both the training and test errors for each choice. (b) Implement the LDA method and report its training and testing errors. Note: Before carrying out the LDA analysis, consider deleting variable 16 ﬁrst from the data, since it takes constant values and may cause the singularity of the covariance matrix. In general, a constant variable does not have a discriminating power to separate two classes.

Python You can import any data you want. Using default hyperparameters: 1. Construct Naive Bayes (NB) models on the training set. 2. Calculate the confusion matrix and report the following performance metrics on the training set: Accuracy, F1 Score, AUC, Sensitivity, Specificity, and Precision. You can use the function p1_metrics for this purpose. 3. Calculate the same metrics by applying the trained model to the validation set. Compare and contrast the errors each model makes in terms of each class. #peformance metric functions from sklearn.metrics import confusion_matrix, roc_auc_score, f1_score import numpy as np #A list of keys for the dictionary returned by p1_metrics metric_keys = ['auc','f1','accuracy','sensitivity','specificity', 'precision'] def p1_metrics(y_true,y_pred,include_cm=True): cm = confusion_matrix(y_true,y_pred) tn, fp, fn, tp = cm.ravel() if include_cm: return { 'auc': roc_auc_score(y_true,y_pred), 'f1': f1_score(y_true,y_pred), 'accuracy':…

Below is a detailed discussion of the advantages of all-subsets regression versus stepwise regression.

Use Boston Hosing data, which is called "Boston" part of MASS library. o library(MASS) o library(ISLR) Questions: Estimate “medv" (median value of owner occupied homes) as a function of all the variables o with Lasso Equalizer – choose the best lambda and print list of coef of predictors o With Ridge Regression or L2-norm – choose lambda and parameters – choose the best lambda and print list of coef of predictors

Algorithm for LLP-GAN training algorithmInput: The training set L = {(Bi, pi)}n i=1; L: number of total iterations; λ: weight parameter.Input: The parameters of the final discriminator D.Set m to the total number of training data points

1 2 3 95 108 110 126 118 102 124 121 145 118 140 155 185 158 8 190 178 9 205 159 10 222 184 The table shows scores of 10 students in Java programming and Data Science The estimated Spearman's rank-order coefficient of correlation p is: A. r=0.842 B. r = -0.842 C. r= 0.158 D. r=-0.158 5 Student 6 Java Programming Data Science

3. PlantGrowth is a dataset contained in R. You can refer to the R help document for its information. The following code performs a t-test to compare two vectors, ctrl (the plant yield in the control group) and trt2 (the plant yield in the treatment group). Run the code and interpret the output (one/two sample test? Paired or unpaired ? one/two sided test? Write the hypothesis, read pvalue -> conclusion ,etc... ) assuming significance threshold 0.05 data("PlantGrowth") ctrl = PlantGrowth$weight[PlantGrowth$group =="ctrl"] PlantGrowth$weight[PlantGrowth$group==="trt2"] trt2 = t.test(ctrl, trt2)

: In decision trees, attribute selection techniques decide the goodnessof a nominal attribute based on the purity of its relevant partitions. Each partitioncontains only tuples that have the same value of that attribute. Explain howattribute selection techniques deal with numeric attributes?

Write a r programming code for logistic regression and calculate these classification metrics: contusion matrix, accuracy, precision, recall, sensitivity, specificity, F1 score and matthew correlation coefficient.

Which statement best describes k-means cluster analysis? It is the process of estimating the value of a continuous outcome variable. It is the process of organizing observations into distinct groups based on a measure of similarity or dissimilarity. It is the process of reducing the number of variables to consider in data-mining. It is the process of agglomerating observations into a series of nested groups based on a measure of similarity or dissimilarity.

SEE MORE QUESTIONS

Recommended textbooks for you

Text book image

Database System Concepts

Computer Science

ISBN:9780078022159

Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Publisher:McGraw-Hill Education

Text book image

Starting Out with Python (4th Edition)

Computer Science

ISBN:9780134444321

Author:Tony Gaddis

Publisher:PEARSON

Text book image

Digital Fundamentals (11th Edition)

Computer Science

ISBN:9780132737968

Author:Thomas L. Floyd

Publisher:PEARSON

Text book image

C How to Program (8th Edition)

Computer Science

ISBN:9780133976892

Author:Paul J. Deitel, Harvey Deitel

Publisher:PEARSON

Text book image

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781337627900

Author:Carlos Coronel, Steven Morris

Publisher:Cengage Learning

Text book image

Programmable Logic Controllers

Computer Science

ISBN:9780073373843

Author:Frank D. Petruzella

Publisher:McGraw-Hill Education

SEE MORE TEXTBOOKS

Related Questions

SEE MORE QUESTIONS

Recommended textbooks for you

Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education

Text book image

Database System Concepts

Computer Science

ISBN:9780078022159

Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Publisher:McGraw-Hill Education

Text book image

Starting Out with Python (4th Edition)

Computer Science

ISBN:9780134444321

Author:Tony Gaddis

Publisher:PEARSON

Text book image

Digital Fundamentals (11th Edition)

Computer Science

ISBN:9780132737968

Author:Thomas L. Floyd

Publisher:PEARSON

Text book image

C How to Program (8th Edition)

Computer Science

ISBN:9780133976892

Author:Paul J. Deitel, Harvey Deitel

Publisher:PEARSON

Text book image

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781337627900

Author:Carlos Coronel, Steven Morris

Publisher:Cengage Learning

Text book image

Programmable Logic Controllers

Computer Science

ISBN:9780073373843

Author:Frank D. Petruzella

Publisher:McGraw-Hill Education

SEE MORE TEXTBOOKS