留学生期中考试题目及要求
1. Discuss the difference between a lunch-driven and a data-driven decision. (Article 1) (5 points)
2. What is the challenge of implementing the long tail strategy? And what is the right way to do that? (article 4) (5 points)
3. Why does Target want to know if a woman is pregnant? And how did Target do it? (Article 2) (6 points)
4. Why do many companies’ NBO strategies fail? (Article 6) (5 points)
5. What are the advantages and disadvantages of using social media for product/service promotion? (article 7) (7 points)
6. Why are “large, diverse crowds of independent thinking people better at predicting the future or solving a problem than the brightest experts among them?” (article 8) (7 points)
7. Given following table, what is the probability of a large company that has been charged for illegal accounting activity before to be fraud? (5 points) (lecture 6)
8. What is the difference between classification and prediction in the context of data mining? Why do we need to partition data for supervised learning? (Lecture 3) (6points)
9. Discuss the difference between supervised and unsupervised learning. Give an example of a business application of supervised learning and one of unsupervised learning. (6points)
10. Consider the following series of business transactions: (Lecture 8)
Transaction 1: involves items A and D
Transaction 2: involves item A
Transaction 3: involves items A, C and D
Transaction 4: involves items B and D
讨论的区别。午餐驱动和数据驱动的决策。(1条)(5分)
2。实施长尾战略的挑战是什么?什么是做正确的方式?(4条)(5分)
3。为什么目标想知道一个女人怀孕?什么目标呢?(2条)(6分)
4。为什么许多公司的NBO策略失败?(6条)(5分)
5。使用社会媒体的产品/服务推广的优点和缺点是什么?(7条)(7分)
6。为什么“大,不同的人群,独立思考的人更好地预测未来或比他们当中最聪明的专家解决问题吗?“(8条)(7分)#p#分页标题#e#
给出了7。下表,一个大型公司,被指控非法会计活动之前被欺诈的概率是多少?(5分)(6讲)
8。在数据挖掘中的分类和预测之间的区别是什么?为什么我们需要有监督学习算法的数据分区?(3讲)(6分)
9。讨论之间的监督和无监督学习的差异。给一个监督学习和无监督学习商业应用一例。(6分)
10。考虑以下交易系列:(8讲)
交易1:包括项目A和D
2:交易涉及的项目
3:交易涉及项目A,C和D
4:交易涉及的项目B和D
Q1. List all item combinations and their support (in percent). (6 points)
Q2. List all possible rules (in the form {X} -> {Y} meaning if set {X} is purchased then set {Y} is also purchased) and their confidence. Note that {X} -> {Y} and {Y} -> {X} are two different rules. (6 points)
Q3. What is the lift ratio for the rule {B} -> {D}? Briefly interpret it. (6 points)
Q4. What is the lift ratio for the rule {A} -> {D}? Briefly interpret it. (6 points)
11. The German Credit data set (available at blackboard) contains observations on 30 variables for 1000 past applicants for credit. Each applicant was rated as “good credit” (700 cases) or “bad credit” (300 cases). New applicants for credit can also be evaluated on these 30 "predictor" variables. We want to develop a credit scoring model that can be used to determine if a new applicant is a good credit risk or a bad credit risk, based on values for one or more of the predictor variables. All the variables are explained in Table 1.1. data has been organized in the spreadsheet GermanCredit.xls)
Table 1.1: Variables for the German Credit data
Var. # Variable Name Description Variable Type Code Description
1 OBS# Observation No. Categorical Sequence Number in data set
2 CHK_ACCT Checking account status Categorical 0 : < 0 DM
1: 0 <= ...< 200 DM
#p#分页标题#e#
2 : => 200 DM
3: no checking account
3 DURATION Duration of credit in months Numerical
4 HISTORY Credit history Categorical 0: no credits taken
1: all credits at this bank paid back duly
2: existing credits paid back duly till now
3: delay in paying off in the past
4: critical account
5 NEW_CAR Purpose of credit Binary car (new) 0: No, 1: Yes
6 USED_CAR Purpose of credit Binary car (used) 0: No, 1: Yes
7 FURNITURE Purpose of credit Binary furniture/equipment 0: No, 1: Yes
8 RADIO/TV Purpose of credit Binary radio/television 0: No, 1: Yes
9 EDUCATION Purpose of credit Binary education 0: No, 1: Yes
10 RETRAINING Purpose of credit Binary retraining 0: No, 1: Yes
11 AMOUNT Credit amount Numerical
12 #p#分页标题#e#SAV_ACCT Average balance in savings account Categorical 0 : < 100 DM
1 : 100<= ... < 500 DM
2 : 500<= ... < 1000 DM
3 : =>1000 DM
4 : unknown/ no savings account
13 EMPLOYMENT Present employment since Categorical 0 : unemployed
1: < 1 year
14 INSTALL_RATE Installment rate as % of disposable income Numerical
15 MALE_DIV Applicant is male and divorced Binary 0: No, 1:Yes
16 MALE_SINGLE Applicant is male and single Binary 0: No, 1:Yes
17 MALE_MAR_WID Applicant is male and married or a widower Binary 0: No, 1:Yes
18 CO-APPLICANT Application has a co-applicant Binary 0: No, 1:Yes
19 GUARANTOR Applicant has a guarantor Binary 0: No, 1:Yes
20 PRESENT_RESIDENT Present resident since - years Categorical 0: <= 1 year
1<…<=2 years#p#分页标题#e#
2<…<=3 years
3:>4years
21 REAL_ESTATE Applicant owns real estate Binary 0: No, 1:Yes
22 PROP_UNKN_NONE Applicant owns no property (or unknown) Binary 0: No, 1:Yes
23 AGE Age in years Numerical
24 OTHER_INSTALL Applicant has other installment plan credit Binary 0: No, 1:Yes
25 RENT Applicant rents Binary 0: No, 1:Yes
26 OWN_RES Applicant owns residence Binary 0: No, 1:Yes
27 NUM_CREDITS Number of existing credits at this bank Numerical
28 JOB Nature of job Categorical 0 : unemployed/ unskilled - non-resident
1 : unskilled - resident
2 : skilled employee / official
3 : management/ self-employed/highly qualified employee/ officer
29 NUM_DEPENDENTS Number of people for whom liable to provide maintenance Numerical
30 #p#分页标题#e#TELEPHONE Applicant has phone in his or her name Binary 0: No, 1:Yes
31 FOREIGN Foreign worker Binary 0: No, 1:Yes
32 RESPONSE Credit rating is good Binary 0: No, 1:Yes
Table 1.2, below, shows the values of these variables for the first several records in the case.
Table 1.2 The data (first several rows)
The consequences of misclassification have been assessed as follows: the cost of a false positive (incorrectly saying an applicant is a good credit risk) is 500 DM, while the cost of false negative (incorrectly saying an applicant is a bad credit risk) is 100 DM. This can be summarized in the following table.
Table 1.3 Opportunity Cost Table (in Dutch Marks)
Predicted (Decision)
Actual Good (Accept) Bad (Reject)
Good 0 100 DM
Bad 500 DM 0
The opportunity cost table was derived from the average net profit per loan as shown below:
Table 1.4 Average Net Profit
Predicted (Decision)
Actual Good (Accept) Bad (Reject)
Good 100 DM 0
Bad - 500 DM 0
Tasks
1. Use ‘GermanCredit.xls’ file and use all variables to develop a Logistic Regression classification model. Create a classification matrix for this model (6 points)
2. Use ‘GermanCredit.xls’ file and select ten variables to develop a Logistic Regression classification model. Create a classification matrix for this model (6 points)#p#分页标题#e#
3. On the classification matrix, there are three types of accuracy that measure the performance of the model. Based the opportunity cost given in the table3 1.3 and 1.4, please indicate which accuracy measure is the most important one in this context. Offer your comments on these models, indicating the outputs and measurements you would use to judge the performance of your models. (6 points)
4. If you want to select 275 customers from the validation data set, which model would you adopt for credit rating? Why? (6 points)