Skip to content

Latest commit

 

History

History
127 lines (93 loc) · 5.64 KB

deepdive-create-models.md

File metadata and controls

127 lines (93 loc) · 5.64 KB
titledescriptionauthorms.authorms.datems.servicems.subservicems.topicmonikerRange
Create R models with RevoScaleR
Learn how to create a linear regression model to analyze the data that you enriched in a previous tutorial.
VanMSFT
vanto
11/27/2018
sql
machine-learning-services
tutorial
>=sql-server-2016||>=sql-server-linux-ver15

Create R models (SQL Server and RevoScaleR tutorial)

[!INCLUDE SQL Server 2016 and later]

This is tutorial 7 of the RevoScaleR tutorial series on how to use RevoScaleR functions with SQL Server.

You have enriched the training data. In this tutorial, you'll analyze the data using regression modeling. Linear models are an important tool in the world of predictive analytics. The RevoScaleR package includes regression algorithms that can subdivide the workload and run it in parallel.

[!div class="checklist"]

  • Create a linear regression model
  • Create a logistic regression model

Create a linear regression model

In this step, create a simple linear model that estimates the credit card balance for the customers using as independent variables the values in the gender and creditLine columns.

To do this, use the rxLinMod function, which supports remote compute contexts.

  1. Create an R variable to store the completed model, and call rxLinMod, passing an appropriate formula.

    linModObj<- rxLinMod(balance~gender+creditLine, data=sqlFraudDS)
  2. To view a summary of the results, call the standard R summary function on the model object.

    summary(linModObj)

You might think it peculiar that a plain R function like summary would work here, since in the previous step, you set the compute context to the server. However, even when the rxLinMod function uses the remote compute context to create the model, it also returns an object that contains the model to your local workstation, and stores it in the shared directory.

Therefore, you can run standard R commands against the model just as if it had been created using the "local" context.

Results

LinearRegressionResultsfor:balance~gender+creditLineData: sqlFraudDS (RxSqlServerDataDataSource) Dependent variable(s):balanceTotalindependentvariables:4 (Includingnumberdropped:1) Numberofvalidobservations:10000Numberofmissingobservations:0Coefficients: (1notdefinedbecauseofsingularities) EstimateStd.Errortvalue Pr(>|t|) (Intercept) 3253.57571.19445.7002.22e-16gender=Male-88.81378.360-1.1330.257gender=FemaleDroppedDroppedDroppedDroppedcreditLine95.3793.86224.6942.22e-16Signif.codes:00.0010.01'*'0.05'.'0.1''1Residualstandarderror:3812on9997degreesoffreedomMultipleR-squared:0.05765AdjustedR-squared:0.05746F-statistic:305.8on2and9997DF, p-value:<2.2e-16Conditionnumber:1.0184

Create a logistic regression model

Next, create a logistic regression model that indicates whether a particular customer is a fraud risk. You'll use the RevoScaleRrxLogit function, which supports fitting of logistic regression models in remote compute contexts.

Keep the compute context as is. You'll also continue to use the same data source as well.

  1. Call the rxLogit function and pass the formula needed to define the model.

    logitObj<- rxLogit(fraudRisk~state+gender+cardholder+balance+numTrans+numIntlTrans+creditLine, data=sqlFraudDS, dropFirst=TRUE)

    Because it is a large model, containing 60 independent variables, including three dummy variables that are dropped, you might have to wait some time for the compute context to return the object.

    The reason the model is so large is that, in R and in the RevoScaleR package, every level of a categorical factor variable is automatically treated as a separate dummy variable.

  2. To view a summary of the returned model, call the R summary function.

    summary(logitObj)

Partial results

LogisticRegressionResultsfor:fraudRisk~state+gender+cardholder+balance+numTrans+numIntlTrans+creditLineData: sqlFraudDS (RxSqlServerDataDataSource) Dependent variable(s):fraudRiskTotalindependentvariables:60 (Includingnumberdropped:3) Numberofvalidobservations:10000-2LogLikelihood:2032.8699 (Residualdevianceon9943degreesoffreedom) Coefficients:EstimateStd.Errorzvalue Pr(>|z|) (Intercept) -8.627e+001.319e+00-6.5386.22e-11state=AKDroppedDroppedDroppedDroppedstate=AL-1.043e+001.383e+00-0.7540.4511 (otherstatesomitted) gender=MaleDroppedDroppedDroppedDroppedgender=Female7.226e-011.217e-015.9362.92e-09cardholder=PrincipalDroppedDroppedDroppedDroppedcardholder=Secondary5.635e-013.403e-011.6560.0977balance3.962e-041.564e-0525.3352.22e-16numTrans4.950e-022.202e-0322.4772.22e-16numIntlTrans3.414e-025.318e-036.4201.36e-10creditLine1.042e-014.705e-0322.1532.22e-16Signif.codes:0'\*\*\*'0.001'\*\*'0.01'\*'0.05'.'0.1''1Conditionnumberoffinalvariance-covariancematrix:3997.308Numberofiterations:15

Next steps

[!div class="nextstepaction"] Score new data

close