R-Guru Smile design: how to one in SIlhouette Studio ยป Smart Silhouette Resource Hub

  • Home
  • Compare with SAS

C. Compare with SAS

This section discusses the similarities and differences between SAS and R.  While most everything from SAS can be replicated in R, there is a steep learning curve since R concepts and process flow are more object oriented. R has meanings for special characters such as [], {} and () for example.  In addition, most of R syntax consists of functions which are similar to SAS functions. So, knowing how to call SAS functions will help to understand, write and execute R functions.  Few SAS and similar R terms are listed below. 

  • SAS                                            R
  • data set                                            data frame
  • variable                                            vector
  • data steps, procedures & functions    R packages and functions (tidyverse)
  • proc sql                                                     dplyr
  • macro programs                                     R user defined functions
  • ODS                                                           R Markdown

R has SQL query type functions that enable filter records, subset variables and summary processing.  Finally, R processes functions as a collection of independent sequence of steps.  So, for example, variables can be created one-by-one with left variable assignments or combined when wrapped in a R function. 

With so many SAS programmers learning R, it makes sense that an R SAS package was created.  With the SASSY package, SAS programmers can have a more seamless transition between R and SAS.  With the SASSY package, SAS programmers can almost replicate reviewing logs, reviewing datasets, and program data steps, formats and reports.  There are R packages and functions to replicate Proc Freq, Proc Means and Proc Report.  It is important to realize that although SAS has tools to integrate with R and R has packages to replicate SAS programming, the objective of learning R is to consider it a standalone complete toolbox in its own right that can be used entirely independent of SAS.

               

          Similar to SAS, R functions resemble many features of the Data Step.


                      





          Compare and Contrast SAS Procedures and R Functions

          proc print data=d1; var _all_; run; d1 displays all records, head

          (d1)  displays first 5 records, tail(d1) displays the last few records

          To subset, first create R object based on subset and selected variables and then display all records

          proc freq data=d1; tables sex*race; run;

          table(d1$sex, d1$race)

          prop.table()

          proc univariate;

          summary(d1)

          proc sort data=dm2; by sex race; run;

          dm2[order(dm1$sex, dm1$race)] 

           proc format;

          In Vectors: sex_code <- c(‘M’, ‘M’, ‘M’, ‘F’, ‘M’) # 1. data values in simple vector to store data values

          sex_decode <- c(‘M’=’Male’, ‘F’=’Female’) # 2. named vector data = ‘label’ for values similar to proc format

          sex <- sex_decode[sex_code] # 3. converts values to labels sex_code vector is subset of sex_code vector

          • As Functions: age_cat <- vectorize(function(x) { # x is input value
          • if (x < 18) { # condition
          • ret <- "< 18" # return label
          • } else if (x >= 18 & x < 24) {
          • ret <- "18 to 24"
          • } else if (x >= 24 & x < 45) {
          • ret <- "24 to 45"
          • } else if (x >= 45 & x < 60) {
          • ret <- "45 to 60"
          • } else if (x >= 60) {
          • ret <- "> 60"
          • } else {
          • ret <- "Unknown"
          • }
          • return(ret) })

          df$age_cat <- age_cat(df$age) # apply function to age variable to create age_cat variable 

          proc means;
          • summarise(AllPages = sum(Pages),
          • AvgLength = mean(Pages),
          • AvgRating = mean(MyRating),
          • AvgReadTime = mean(read_time),
          • ShortRT = min(read_time),
          • LongRT = max(read_time),
          • TotalAuthors = n_distinct(Author)) 
           proc contents;

          library(Hmisc)

          contents(dm)

          • 7 types of dataframe compare results

          • 1 - Ideal, Exact match
          • - For all variables and for all records, no difference in two datasets

          • 2 - Inconsistent Sorting of Records
          • - Need to sort both dataframes by key variables, (Most important, First to correct)

          • 3 - Misalignment of Key Variables using ID statement
          • - Need to correct variable name / type / length / label

          • 4 - Record Count mismatch
          • - Need to view list of records in one dataframe but not in the other dataframe and update record selection condition / add / delete records as needed, test with subsets

          • 5 - Variable mismatch
          • - Need to view the list of variables in one dataframe but not in the other dataframe and add / delete variables

          • 6 - Data Values mismatch
          • - Need to view source dataframe with filter / sort to identify correct values if needed, update variable assignments, consider: case-sensitive, spaces, and rounding.

          • 7 - Variable Attributes mismatch
          • - Need to update variable attributes (name / type / length / label) (Generally length and label mismatch may be acceptable, not not name or type)

          # compare dataframes using diffdf

          • # diffdf( base=test_data , compare=test_data2 , keys = c("group1" , "group"))
          • # compare data frames by group1 and group 2 variables, useful to include key vars in differences

          • install.packages("diffdf")
          • library(diffdf)

          • LENGTH = 30 # assign variable value

          • suppressWarnings(RNGversion("3.5.0"))
          • set.seed(12334)

          • test_data <- tibble::tibble(
          • ID = 1:LENGTH,
          • GROUP1 = rep( c(1,2) , each = LENGTH/2),
          • GROUP2 = rep( c(1:(LENGTH/2)), 2 ),
          • INTEGER = rpois(LENGTH , 40),
          • BINARY = sample( c("M" , "F") , LENGTH , replace = T),
          • DATE = lubridate::ymd("2000-01-01") + rnorm(LENGTH, 0, 7000),
          • DATETIME = lubridate::ymd_hms("2000-01-01 00:00:00") + rnorm(LENGTH, 0, 200000000),
          • CONTINUOUS = rnorm(LENGTH , 30 , 12),
          • CATEGORICAL = factor(sample( c("A" , "B" , "C") , LENGTH , replace = T)), # factor variable type
          • LOGICAL = sample( c(TRUE , FALSE) , LENGTH , replace = T),
          • CHARACTER = stringi::stri_rand_strings(LENGTH, rpois(LENGTH , 13), pattern = "[ A-Za-z0-9]")
          • )

          • test_data # sample data frame

          • diffdf( test_data , test_data) # expects same variable names, type, label, order and number of rows
          • # first data frame is base and second data frame is compare

          • test_data2 <- test_data
          • test_data2 <- test_data2[,-6] # remove date variable
          • diffdf(test_data , test_data2 , suppress_warnings = T) # suppress warnings

          • test_data2 <- test_data
          • test_data2 <- test_data2[1:(nrow(test_data2) - 2),] # remove last two records
          • diffdf(test_data, test_data2 , suppress_warnings = T)

          • test_data2 <- test_data
          • test_data2[5, 2] <- 6 # assign group1 variable row 5 the value of 6
          • diffdf(test_data , test_data2 , suppress_warnings = T)

          • test_data2 <- test_data
          • test_data2[,2] <- as.character(test_data2[,2]) # change group1 from numeric to character variable type
          • diffdf(test_data , test_data2 , suppress_warnings = T)

          • test_data2 <- test_data
          • attr(test_data$ID , "label") <- "This is a interesting label"
          • attr(test_data2$ID , "label") <- "what do I type here?" # assign different label to ID
          • diffdf(test_data , test_data2 , suppress_warnings = T)

          • test_data2 <- test_data
          • levels(test_data2$CATEGORICAL) <- c(1,2,3) # assign different level values 1, 2 and 3 instead of A, B and C
          • diffdf(test_data , test_data2 , suppress_warnings = T)

          • test_data2 <- test_data
          • test_data2$INTEGER[c(5,2,15)] <- 99L # assign different value to variable
          • diffdf( test_data , test_data2 , keys = c("GROUP1" , "GROUP2") , suppress_warnings = T) # by group1 and group 2 variables, useful to include in differences

          • iris2 <- iris
          • for (i in 1:3) iris2[i,i] <- 99 # assign 99 values to iris2 data frame

          • diff <- diffdf( iris , iris2, suppress_warnings = TRUE) # save difference results to data frame

          • diffdf_issuerows( iris , diff) # display iris row issues

          • diffdf_issuerows( iris2 , diff) # display iris2 row issues

          • diffdf_issuerows( iris2 , diff , vars = "Sepal.Length") 

          • diffdf_issuerows( iris2 , diff , vars = c("Sepal.Length" , "Sepal.Width"))


          • iris2 <- iris
          • for (i in 1:3) iris2[i,i] <- 99 # assign 99 values to iris2 data frame
          • diff <- diffdf( iris , iris2, suppress_warnings = TRUE)

          • diffdf_has_issues(diff) # true or false for data frame differences

          • if ( diffdf_has_issues(diff)){
          • #
          • }

          • dsin1 <- data.frame(x = 1.1e-06)
          • dsin2 <- data.frame(x = 1.1e-07) # assign different value
          • diffdf(dsin1 , dsin2 , suppress_warnings = T)

          • dsin1 <- data.frame(x = as.integer(c(1,2,3))) # integer variable
          • dsin2 <- data.frame(x = as.numeric(c(1,2,3))) # numeric variable
          • diffdf(dsin1 , dsin2 , suppress_warnings = T)

          • diffdf(dsin1 , dsin2 , suppress_warnings = T, strict_numeric = FALSE) # option to accept integer as numeric

          • dsin1 <- data.frame(x = as.character(c(1,2,3)), stringsAsFactors = FALSE)
          • dsin2 <- data.frame(x = as.factor(c(1,2,3))) # factors are numeric variable type
          • diffdf(dsin1 , dsin2 , suppress_warnings = T)

          • diffdf(dsin1 , dsin2 , suppress_warnings = T, strict_factor = FALSE) # option to accept character and factors

          proc compare using comparedf;

          cmp <- comparedf(mockstudy, mockstudy2, by = "case", tol.vars = c("._ ", "case"), int.as.num = TRUE)

          n.diffs(cmp) 

           proc report;

          flextable()

          Section Objectives

          • Understand and be able to apply SASSY Package for Tables, Lists and Graphs
          • Know how to apply SQL methods in R programming
          • Understand the difference between SAS macros and R programming
          • Know how to apply Tidyverse Package 
          Bayer's R and SAS

                               


          Powered by Wild Apricot Membership Software