Skip to content

Latest commit

 

History

History
249 lines (188 loc) · 13.8 KB

walkthrough-create-data-features.md

File metadata and controls

249 lines (188 loc) · 13.8 KB
titledescriptionauthorms.authorms.datems.servicems.subservicems.topicmonikerRange
R tutorial: Feature engineering
Tutorial showing how to create data features using SQL Server functions for in-database analytics.
VanMSFT
vanto
11/26/2018
sql
machine-learning-services
tutorial
>=sql-server-2016||>=sql-server-linux-ver15

Create data features using R and SQL Server (walkthrough)

[!INCLUDE SQL Server 2016]

Data engineering is an important part of machine learning. Data often requires transformation before you can use it for predictive modeling. If the data does not have the features you need, you can engineer them from existing values.

For this modeling task, rather than using the raw latitude and longitude values of the pickup and drop-off location, you'd like to have the distance in miles between the two locations. To create this feature, you compute the direct linear distance between two points, by using the haversine formula.

In this step, learn two different methods for creating a feature from data:

[!div class="checklist"]

  • Using a custom R function
  • Using a custom T-SQL function in [!INCLUDEtsql]

The goal is to create a new [!INCLUDEssNoVersion] set of data that includes the original columns plus the new numeric feature, direct_distance.

Prerequisites

This step assumes an ongoing R session based on previous steps in this walkthrough. It uses the connection strings and data source objects created in those steps. The following tools and packages are used to run the script.

  • Rgui.exe to run R commands
  • Management Studio to run T-SQL

Featurization using R

The R language is well-known for its rich and varied statistical libraries, but you still might need to create custom data transformations.

First, let's do it the way R users are accustomed to: get the data onto your laptop, and then run a custom R function, ComputeDist, which calculates the linear distance between two points specified by latitude and longitude values.

  1. Remember that the data source object you created earlier gets only the top 1000 rows. So let's define a query that gets all the data.

    bigQuery<-"SELECT tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude FROM nyctaxi_sample";
  2. Create a new data source object using the query.

    featureDataSource<- RxSqlServerData(sqlQuery=bigQuery,colClasses= c(pickup_longitude="numeric", pickup_latitude="numeric", dropoff_longitude="numeric", dropoff_latitude="numeric", passenger_count="numeric", trip_distance="numeric", trip_time_in_secs="numeric", direct_distance="numeric"), connectionString=connStr);
    • RxSqlServerData can take either a query consisting of a valid SELECT query, provided as the argument to the sqlQuery parameter, or the name of a table object, provided as the table parameter.

    • If you want to sample data from a table, you must use the sqlQuery parameter, define sampling parameters using the T-SQL TABLESAMPLE clause, and set the rowBuffering argument to FALSE.

  3. Run the following code to create the custom R function. ComputeDist takes in two pairs of latitude and longitude values, and calculates the linear distance between them, returning the distance in miles.

    env<- new.env(); env$ComputeDist<-function(pickup_long, pickup_lat, dropoff_long, dropoff_lat){ R<-6371/1.609344#radius in miledelta_lat<-dropoff_lat-pickup_latdelta_long<-dropoff_long-pickup_longdegrees_to_radians=pi/180.0a1<- sin(delta_lat/2*degrees_to_radians) a2<- as.numeric(a1)^2a3<- cos(pickup_lat*degrees_to_radians) a4<- cos(dropoff_lat*degrees_to_radians) a5<- sin(delta_long/2*degrees_to_radians) a6<- as.numeric(a5)^2a<-a2+a3*a4*a6c<-2*atan2(sqrt(a),sqrt(1-a)) d<-R*creturn (d) }
    • The first line defines a new environment. In R, an environment can be used to encapsulate name spaces in packages and such. You can use the search() function to view the environments in your workspace. To view the objects in a specific environment, type ls(<envname>).
    • The lines beginning with $env.ComputeDist contain the code that defines the haversine formula, which calculates the great-circle distance between two points on a sphere.
  4. Having defined the function, you apply it to the data to create a new feature column, direct_distance. but before you run the transformation, change the compute context to local.

    rxSetComputeContext("local");
  5. Call the rxDataStep function to get the feature engineering data, and apply the env$ComputeDist function to the data in memory.

    start.time<- proc.time(); changed_ds<- rxDataStep(inData=featureDataSource, transforms=list(direct_distance=ComputeDist(pickup_longitude,pickup_latitude, dropoff_longitude, dropoff_latitude), tipped="tipped", fare_amount="fare_amount", passenger_count="passenger_count", trip_time_in_secs="trip_time_in_secs", trip_distance="trip_distance", pickup_datetime="pickup_datetime", dropoff_datetime="dropoff_datetime"), transformEnvir=env, rowsPerRead=500, reportProgress=3); used.time<- proc.time() -start.time; print(paste("It takes CPU Time=", round(used.time[1]+used.time[2],2)," seconds, Elapsed Time=", round(used.time[3],2), " seconds to generate features.", sep=""));

    However, a couple of points worth noting regarding rxDataStep:

    In other data sources, you can use the arguments varsToKeep and varsToDrop, but these are not supported for SQL Server data sources. Therefore, in this example, we've used the transforms argument to specify both the pass-through columns and the transformed columns. Also, when running in a SQL Server compute context, the inData argument can only take a SQL Server data source.

    The preceding code can also produce a warning message when run on larger data sets. When the number of rows times the number of columns being created exceeds a set value (the default is 3,000,000), rxDataStep returns a warning, and the number of rows in the returned data frame will be truncated. To remove the warning, you can modify the maxRowsByCols argument in the rxDataStep function. However, if maxRowsByCols is too large, you might experience problems when loading the data frame into memory.

  6. Optionally, you can call rxGetVarInfo to inspect the schema of the transformed data source.

    rxGetVarInfo(data=changed_ds);

Featurization using Transact-SQL

In this exercise, learn how to accomplish the same task using SQL functions instead of custom R functions.

Switch to SQL Server Management Studio or another query editor to run the T-SQL script.

  1. Use a SQL function, named fnCalculateDistance. The function should already exist in the NYCTaxi_Sample database. In Object Explorer, verify the function exists by navigating this path: Databases > NYCTaxi_Sample > Programmability > Functions > Scalar-valued Functions > dbo.fnCalculateDistance.

    If the function does not exist, use SQL Server Management Studio to generate the function in the NYCTaxi_Sample database.

    CREATE FUNCTION [dbo].[fnCalculateDistance] (@Lat1 float, @Long1 float, @Lat2 float, @Long2 float) -- User-defined function calculates the direct distance between two geographical coordinates. RETURNS decimal(28, 10) ASBEGIN DECLARE @distance decimal(28, 10) -- Convert to radiansSET @Lat1 = @Lat1 /57.2958SET @Long1 = @Long1 /57.2958SET @Lat2 = @Lat2 /57.2958SET @Long2 = @Long2 /57.2958-- Calculate distanceSET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) * COS(@Long2 - @Long1)) --Convert to miles IF @distance <>0BEGINSET @distance =3958.75* ATAN(SQRT(1- POWER(@distance, 2)) / @distance); END RETURN @distance END
  2. In Management Studio, in a new query window, run the following [!INCLUDEtsql] statement from any application that supports [!INCLUDEtsql] to see how the function works.

    USE nyctaxi_sample GO SELECT tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance, pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude FROM nyctaxi_sample
  3. To insert values directly into a new table (you have to create it first), you can add an INTO clause specifying the table name.

    USE nyctaxi_sample GO SELECT tipped, fare_amount, passenger_count, trip_time_in_secs, trip_distance, pickup_datetime, dropoff_datetime, dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance, pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude INTO NewFeatureTable FROM nyctaxi_sample
  4. You can also call the SQL function from R code. Switch back to Rgui and store the SQL featurization query in an R variable.

    featureEngineeringQuery="SELECT tipped, fare_amount, passenger_count, trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance, pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude FROM nyctaxi_sample tablesample (1 percent) repeatable (98052)"

    [!TIP] This query has been modified to get a smaller sample of data, to make this walkthrough faster. You can remove the TABLESAMPLE clause if you want to get all the data; however, depending on your environment, it might not be possible to load the full dataset into R, resulting in an error.

  5. Use the following lines of code to call the [!INCLUDEtsql] function from your R environment and apply it to the data defined in featureEngineeringQuery.

    featureDataSource= RxSqlServerData(sqlQuery=featureEngineeringQuery, colClasses= c(pickup_longitude="numeric", pickup_latitude="numeric", dropoff_longitude="numeric", dropoff_latitude="numeric", passenger_count="numeric", trip_distance="numeric", trip_time_in_secs="numeric", direct_distance="numeric"), connectionString=connStr)
  6. Now that the new feature is created, call rxGetVarsInfo to create a summary of the data in the feature table.

    rxGetVarInfo(data=featureDataSource)

    Results

    Var1:tipped, Type:integerVar2:fare_amount, Type:numericVar3:passenger_count, Type:numericVar4:trip_time_in_secs, Type:numericVar5:trip_distance, Type:numericVar6:pickup_datetime, Type:characterVar7:dropoff_datetime, Type:characterVar8:direct_distance, Type:numericVar9:pickup_latitude, Type:numericVar10:pickup_longitude, Type:numericVar11:dropoff_latitude, Type:numericVar12:dropoff_longitude, Type:numeric

    [!NOTE] In some cases, you might get an error like this one: The EXECUTE permission was denied on the object 'fnCalculateDistance' If so, make sure that the login you are using has permissions to run scripts and create objects on the database, not just on the instance. Check the schema for the object, fnCalculateDistance. If the object was created by the database owner, and your login belongs to the role db_datareader, you need to give the login explicit permissions to run the script.

Comparing R functions and SQL functions

Remember this piece of code used to time the R code?

start.time<- proc.time() <yourcodehere>used.time<- proc.time() -start.time print(paste("It takes CPU Time=", round(used.time[1]+used.time[2],2)," seconds, Elapsed Time=", round(used.time[3],2), " seconds to generate features.", sep=""))

You can try using this with the SQL custom function example to see how long the data transformation takes when calling a SQL function. Also, try switching compute contexts with rxSetComputeContext and compare the timings.

Your times might vary significantly, depending on your network speed, and your hardware configuration. In the configurations we tested, the [!INCLUDEtsql] function approach was faster than using a custom R function. Therefore, we've use the [!INCLUDEtsql] function for these calculations in subsequent steps.

Tip

Very often, feature engineering using [!INCLUDEtsql] will be faster than R. For example, T-SQL includes fast windowing and ranking functions that can be applied to common data science calculations such as rolling moving averages and n-tiles. Choose the most efficient method based on your data and task.

Next steps

[!div class="nextstepaction"] Build an R model and save to SQL

close