title	description	author	ms.author	ms.date	ms.service	ms.subservice	ms.topic	monikerRange
Modify SQL data using RevoScaleR	Learn how to query and modify data using the R language on SQL Server, specifically the RevoScaleR function.	VanMSFT	vanto	11/27/2018	sql	machine-learning-services	tutorial	>=sql-server-2016\|\|>=sql-server-linux-ver15

Query and modify the SQL Server data (SQL Server and RevoScaleR tutorial)

[!INCLUDE SQL Server 2016 and later]

This is tutorial 3 of the RevoScaleR tutorial series on how to use RevoScaleR functions with SQL Server.

In the previous tutorial, you loaded the data into [!INCLUDEssNoVersion]. In this tutorial, you can explore and modify data using RevoScaleR:

[!div class="checklist"]
Return basic information about the variables
Create categorical data from raw data

Categorical data, or factor variables, are useful for exploratory data visualizations. You can use them as inputs to histograms to get an idea of what variable data looks like.

Query for columns and types

Use an R IDE or RGui.exe to run R script.

First, get a list of the columns and their data types. You can use the function rxGetVarInfo and specify the data source you want to analyze. Depending on your version of RevoScaleR, you could also use rxGetVarNames.

rxGetVarInfo(data=sqlFraudDS)

Results

Var1:custID, Type:integerVar2:gender, Type:integerVar3:state, Type:integerVar4:cardholder, Type:integerVar5:balance, Type:integerVar6:numTrans, Type:integerVar7:numIntlTrans, Type:integerVar8:creditLine, Type:integerVar9:fraudRisk, Type:integer

Create categorical data

All the variables are stored as integers, but some variables represent categorical data, called factor variables in R. For example, the column state contains numbers used as identifiers for the 50 states plus the District of Columbia. To make it easier to understand the data, you replace the numbers with a list of state abbreviations.

In this step, you create a string vector containing the abbreviations, and then map these categorical values to the original integer identifiers. Then you use the new variable in the colInfo argument, to specify that this column be handled as a factor. Whenever you analyze the data or move it, the abbreviations are used and the column is handled as a factor.

Mapping the column to abbreviations before using it as a factor actually improves performance as well. For more information, see R and data optimization.

Begin by creating an R variable, stateAbb, and defining the vector of strings to add to it, as follows.

stateAbb<- c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI","IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO", "MS", "MT", "NB", "NC", "ND", "NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA", "RI","SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY")

Next, create a column information object, named ccColInfo, that specifies the mapping of the existing integer values to the categorical levels (the abbreviations for states).

This statement also creates factor variables for gender and cardholder.

ccColInfo<-list( gender=list( type="factor", levels= c("1", "2"), newLevels= c("Male", "Female") ), cardholder=list( type="factor", levels= c("1", "2"), newLevels= c("Principal", "Secondary") ), state=list( type="factor", levels= as.character(1:51), newLevels=stateAbb ), balance=list(type="numeric") )

To create the [!INCLUDEssNoVersion] data source that uses the updated data, call the RxSqlServerData function as before, but add the colInfo argument.
```
sqlFraudDS<- RxSqlServerData(connectionString=sqlConnString, table=sqlFraudTable, colInfo=ccColInfo, rowsPerRead=sqlRowsPerRead)
```
- For the table parameter, pass in the variable sqlFraudTable, which contains the data source you created earlier.
- For the colInfo parameter, pass in the ccColInfo variable, which contains the column data types and factor levels.

You can now use the function rxGetVarInfo to view the variables in the new data source.

rxGetVarInfo(data=sqlFraudDS)

Results

Var1:custID, Type:integerVar2:gender2factorlevels:MaleFemaleVar3:state51factorlevels:AKALARAZCA...VTWAWIWVWYVar4:cardholder2factorlevels:PrincipalSecondaryVar5:balance, Type:integerVar6:numTrans, Type:integerVar7:numIntlTrans, Type:integerVar8:creditLine, Type:integerVar9:fraudRisk, Type:integer

Now the three variables you specified (gender, state, and cardholder) are treated as factors.

Next steps

[!div class="nextstepaction"] Define and use compute contexts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepdive-query-and-modify-the-sql-server-data.md

deepdive-query-and-modify-the-sql-server-data.md

Query and modify the SQL Server data (SQL Server and RevoScaleR tutorial)

Query for columns and types

Create categorical data

Next steps

Files

deepdive-query-and-modify-the-sql-server-data.md

Latest commit

History

deepdive-query-and-modify-the-sql-server-data.md

File metadata and controls

Query and modify the SQL Server data (SQL Server and RevoScaleR tutorial)

Query for columns and types

Create categorical data

Next steps