22576 FOUNDATIONS OF BUSINESS DATA ANALYTICS

联系我们: 手动添加方式: 微信>添加朋友>企业微信联系人>13262280223 或者 QQ: 1483266981

22576 FBDA SPR24 Page 1
22576 FOUNDATIONS OF BUSINESS DATA ANALYTICS
Assessment 3 Specification
Loan Approvals Analysis Project
Spring 2024
(Version 25/09/2024)
Task overview
In this assessment, you are provided with a business scenario and (modified) real world dataset. You are
required to use the dataset, cleanse it and apply predictive business analytics techniques to it so as to be able
generate a comprehensive understanding of the relationship between variables in response to identified
business problems. You are required to present these insights in a clear and effective manner to a virtual
client (large bank). In the absence of a real client, this presentation is not interactive, i.e. it is a recorded video
presentation.
This individual assessment is worth 50% of the subject mark.
Context:
You work for Insight Finance Analytics Pty Ltd, a consulting company specialising decision support for banks
and other financial service providers. The XYZ Bank has approached your company to provide insights into
causal relationships in their dataset and finally predict and support loan approvals based on past
demographic data.
This case is about real‐world personal loan requests based on a broad range of demographic information,
and request (application) outcomes (approved or not approved). The data file Loan Approvals SPR24.xlsx
contains 614 raw data transactions and an additional worksheet with data variable explanations (rough data
dictionary) and data cleaning instructions of recommendations (which are not repeated here).
22576 FBDA SPR24 Page 2
Task 1: Data preparation, analysis, and results (60%)
Task 1.1: ‘Loan Approvals’ – Data preparation and descriptive analysis
Requirements:
1.1.A. Preliminary data exploration, understanding and minor corrections in Excel file:
The starting point for this subtask is the dataset itself, including the data dictionary and preparation
tasks/notes in the second worksheet (Raw_data_dict).
The next suggested step is to filter each column in the raw data to see whether there are blanks
(missing values) or obvious data entry errors; the latter should be corrected in the Excel spreadsheet
already before loading. The anomaly in the column ‘Dependents’ should also be corrected in the Excel
spreadsheet first, making reasonable assumptions. Finally, add and generate the variables
Income_total and Loan_Income_ratio as explained in the data dictionary tab.
1.1.B. Data preparation for multiple analysis tasks:
After the initial error removal in the source file (1.A.), you now have to perform the following data
preparation steps in that order:
Missing value handling/transformation;
Data transformation for statistical analysis (e.g. string/cat to numeric, one to many);
Outlier identification and treatment.
For this sub‐task, you have essentially three options:
1) Use KNIME workflows for each analysis task or all tasks in one single workflow;
2) Use only EXCEL or any other tool and load the data for each analysis task;
3) Use a combination of the two options above (e.g. create a KNIME workflow to prepare the data
and write or copy them to Excel files after outlier correction, and then use this for analysis tasks in
Excel and KNIME; OR: treat missing variables and convert category variables already in Excel do
outlier analysis only in KNIME.
Irrespective of the option chosen, you have to report/present on:
the rationale behind the data treatment option chosen for missing values, outliers, variable
conversion, and variable selection for each analysis (if not prescribed), for each variable and each
analysis method type (e.g. predicting continuous vs. categorical variables);
the process/tool(s) used for data preparation;
meaningful descriptive statistics of the treated data.
You can proceed with your own preferences, as long as they produce reasonable data input for
subsequent analysis and you can provide reasonable justifications (in the presentation/video).
Particular care should be taken with outlier correction: Run the node and look at the outlier
identification summary to arrive at a value for ‘k’ which you are comfortable with. Also note that
outlier identification is useless when looking at categorical dummy variables (0, 1).
1.1.C. Preliminary Descriptive Analysis for Better Data Understanding:
In Excel, you may use the Descriptive Statistics Add‐on (in Data Analysis), various charts and the
Correlation Add‐on for data understanding.
In KNIME, use the Scatter Plot, Linear Correlation and Statistics nodes to gain further insights into
base patterns in the data and base relationships.
In your presentation, include any potentially interesting findings.
22576 FBDA SPR24 Page 3
Task 1.2: ‘Loan Approvals’ – Predictive Analysis
General Instructions (important!):
‐ All analysis is to be based on the data prepared in Task 1.1., the data output in the final node of your
workflow up to outlier correction (included).
‐ Unless otherwise indicated or not possible, built your workflows and configure your nodes in line with the
ongoing in‐class demonstration exercises.
‐ As for the KNIME workflow tasks, you can model all tasks in one single workflow, or create separate
workflows for each sub‐tasks or combinations thereof.
‐ All categorical predictor models (1.2.B and 1.2.C) have THREE important communalities:
Y = Loan_Approval (Yes); X1‐X5 = GRADUATE, SELF‐EMPLOYED, MARRIED, PRIOR_CREDIT, and
LOAN_INCOME_RATIO.
Data partitioning is set at 50% each, selected randomly based on random seed 12345.
Both methods have to be tested for performance using an appropriate ROC curve chart, and
confusion matrix, and key performance indicators area below the (ROC) curve, accuracy, Cohen’s
Kappa (k), class 0 and class 1 errors. In addition, each method requires specific analysis as specified
below.
Requirements:
1.2.A. Correlations and Linear Regression with Excel and KNIME:
i. Linear Regression Analysis – Building Models
Perform linear regression analysis on the appropriately prepared (see 1.1. above) full sample using the
Excel Data Analysis tool on the with dependant variable Y = INCOME_PM and independent variables X1
= GRADUATE, X2 = SELF‐EMPLOYED and X3 = GENDER (only ‘male’, ‘other’), with otherwise standard
settings, but using ‘labels’.
ii. Verify the regression results in KNIME using node ‘Linear Regression Learner’ and the same treated
data as per i) above, and write the output of port 2 to and Excel file using the ‘Excel Writer’ node.
iii. Provide additional linear regression analysis in KNIME by partitioning the sample randomly (seed
12345) into a 40% training set and 60% validation set, and by adding a further ‘Regression Predictor’
and ‘Numeric Scorer’ node to the KNIME model/workflow. Write the output of the n‐scorer node to an
Excel file.
iv. Linear Regression Analysis – Analysis: Answer the following questions:
1. Do ‘Graduates’, ‘self‐employed’ or ‘males’ loan applicants have higher income_pm than the rest
How much in terms of incremental change in each of the independent variables
2. Which independent variables in the 2 regression models are statistically significant in terms of
mainstream conventions
3. Why are the linear regression results in the KNIME workflow(s) so different from each other and
why
4. How would you rate the two KNIME regression results in terms of explaining variations in
INCOME_PM.
5. Does the model developed in the regression learner in iii) do well in validation Why (not)
22576 FBDA SPR24 Page 4
1.2.B. Logistic Regression with KNIME:
i. Logistic Regression Model:
Build and execute a Logistic Regression (LR) model based on the General Instructions above and with
reference to the LR models built in class.
ii. LR Analysis:
1. What is the regression equation generated by the LR Learner
2. Assess the ‘Coefficients and Statistics’ of the LR Learner: Would you consider dropping a variable (or
variables) from the model Why or why not
3. Generate all performance indicators as required in the General Instructions above.
4. Interpret the performance indicators and make conclusions about the performance of the model.
1.2.C. Decision Tree (CART) with KNIME:
i. Decision Tree Model:
Build a Decision Tree (DT) model with based on the General Instructions above and using the DT Learner
settings used in class (Gain Ratio, no pruning, no reduced error pruning, etc.), with one exception: The
‘minimum number of records per node’ should be 50. Ensure you include a ‘column filter’ before
partitioning, and that you only include the predictors as specified in the General Instructions above.
ii. DT Analysis:
1. Generate all performance indicators (PI) as required in the General Instructions above.
2. Interpret the PIs and make conclusions about the performance of the model.
3. Compare the PIs of the LR and DT models. Which one do you think is superior Why
4. Analyse the Decision Tree:
Which variables are used in the tree to generate branches Which branch in the tree delivers the purest
sub‐sample
5. Apply the DT:
If a married loan applicant has a Loan_Income_Ratio of 4% and had a Prior Credit, what is the
likelihood that he/she has a Loan_Approval Card
1.2.D. Sensitivity Analysis:
i. For Linear Regression:
Experimenting with different sets of independent variables from the whole dataset, can you find a better
regression model to predict Income_pm If so, argue why you think it is better.
ii. For LR:
Experimenting with different sets of independent variables from the whole dataset, can you find a better LR
model to predict Loan_Approval If so, argue why you think it is better.
iii. For DT:
Experimenting with different configuration settings for the Decision Tree Learner, can you find a better
model to predict Loan_Approval If so, argue why you think it is better.
22576 FBDA SPR24 Page 5
Task 2: Presentation (Video) (40%)
As per ‘Task Overview’, you are required to present your insights from your descriptive and predictive
analysis in a clear and effective manner to a virtual client (large bank), not your lecturer. You have to tell
the story of the insights you can provide. The client will be primarily interested in the results and findings,
and not so much in how you arrived there – but still in assumptions and potential limitations. The client
won’t be interested in underlying processes and software details, let alone things like e.g. KNIME node
configurations!!
Requirement:
Using the data provided and all the steps required in Task 1, provide a 4 to 5‐minute video presentation
focussing on findings and results rather than how you arrived at them (except for assumptions made,
including treatment of missing values and numeric outliers, and limitations of the analytics processes and
methods). Obviously, the questions asked under task 1.2. are a strong indicator on what you have to
mention/address in your presentation!
If your sensitivity analysis in Task 1.2.D has better results than the default instructions, present the better
results and the corresponding underlying parameter.
Recommended Procedure:
‐ Once you have completed Task 1, sketch your ‘storyline’ in bullet points, e.g. in Word or PowerPoint.
‐ Gather all the Output from your analysis in Excel and KNIME to support this storyline.
‐ Make a ‘conventional’ PowerPoint presentation by translating your storyline and filling it with
visualisations (tables, charts, etc.). Going through that presentation should not take more than 4‐5
minutes!
‐ Record the PowerPoint presentation with camera showing you as presenter and convert it to
a video. If you stop here, ensure that you meet the base requirement that you are visible as presenter at
least 30 percent of the video time.
‐ Optional: Enhance your base video using alternative screen recording tools or editing tools, e.g.
here (external site).
‐ You may use AI‐generative tools to support the creation of your video, as long as you acknowledge them
at the beginning or the end of the video.
Deliverables:
1. The video presentation (see above) in a mainstream video format (e.g. MP4).
2. An Excel spreadsheet containing separate worksheets for the following treated datasets or data
outputs of KNIME analysis using the Excel‐Writer node to extract the output for the first three:
‐ Output of sub‐task 1.1.A;
‐ Output of sub‐task 1.1.B;
‐ Output of sub‐task 1.2.A ii) and iii)
‐ All ‘Scorer’ results and ROC curves.
Submission of assignments
The two assignment files are to be submitted in the assignments section of CANVAS.

发表评论

了解 KJESSAY历史案例 的更多信息

立即订阅以继续阅读并访问完整档案。

继续阅读