The <-
assignment operator creates a binding or a reference from the name on the left to the object on the right. The lobstr package can be helpful for investigating R data structures. For example the obj_addr()
or obj_addrs()
functions from the lobstr
package can be used to see the memory address of objects:
x <- "hello"
y <- x
lobstr::obj_addr(x)
#> [1] "0x5581cdefd650"
lobstr::obj_addr(y)
#> [1] "0x5581cdefd650"
lobstr::obj_addrs(list(x, y))
#> [1] "0x5581cdefd650"
A syntactically valid name must consist of only:
.
)_
)Syntactically valid names must start with:
Names must not be any of the words reserved by R’s parser (see ?Reserved
for the full list).
_aVar <- "hello"
#> Error: unexpected input in "_"
.1var <- "hello"
#> Error: unexpected symbol in ".1var"
TRUE <- "hello"
#> Error in TRUE <- "hello" : invalid (do_set) left-hand side to assignment
See ?make.names
for more detail.
You can use backticks if you need to use names that are not syntactically valid.
`_aVar` <- "hello"
`_aVar`
#> [1] "hello"
`.1var` <- "hello"
`.1var`
#> [1] "hello"
`TRUE` <- "hello"
`TRUE`
#> [1] "hello"
This can sometimes be helpful if you are loading external data. For example loading a CSV with underscores in the header.
Notice the check.names = FALSE
parameter to the read.table
function. This prevents automatic conversion to syntactically valid names.
df1 <- read.table(
text = "_var1,_var_2\n0.01,1\n0.05,0",
sep = ",",
check.names = FALSE,
header = TRUE
)
df1
#> _var1 _var_2
#> 1 0.01 1
#> 2 0.05 0
df1$`_var1`
#> [1] 0.01 0.05
Generally objects are only copied when they are modified:
x <- c("hello", "world")
lobstr::obj_addr(x)
#> [1] "0x5581cf7376b8"
y <- x
# `x` and `y` reference the same memory address
lobstr::obj_addrs(list(x, y))
#> [1] "0x5581cf7376b8" "0x5581cf7376b8"
# modify `y`
y[[1]] <- "hi"
lobstr::obj_addrs(list(x, y)
#> [1] "0x5581cf7376b8" "0x5581cf737538"
# `y` has been copied to a new memory address
# modify `y` again
y[[2]] <- "everyone"
lobstr::obj_addr(y)
#> [1] "0x5581cf736338"
# memory address of `y` not changed
This is also true for objects that are used as arguments to functions:
aFunc <- function(arg1) {
# a simple function that just returns the input
return(arg1)
}
input <- c("hello", "world")
lobstr::obj_addr(input)
#> [1] "0x5581cfe5d988"
output <- aFunc(input)
lobstr::obj_addrs(list(input, output))
#> [1] "0x5581cfe5d988" "0x5581cfe5d988"
# `input` and `output` reference the same memory address
Objects with a single name referencing it can usually be modified in place:
# if run in RStudio all lines need to be run together
x <- c(1L, 2L, 3L)
lobstr::obj_addr(x)
#> [1] "0x55ffb8d5b618"
x[[2]] <- 9L
lobstr::obj_addr(x)
#> [1] "0x55ffb8d5b618"
Objects with more than one name referencing it will not be modified in place:
x <- c(1L, 2L, 3L)
y <- x
lobstr::obj_addrs(list(x, y))
#> [1] "0x55ffb9328018" "0x55ffb9328018"
x[[2]] <- 9L
lobstr::obj_addrs(list(x, y))
See also:
Lists do not store values. They store references to values. The lobstr::ref()
function can demonstrate that:
aVector <- c("hello", "world")
lobstr::obj_addr(aVector)
#> [1] "0x5581d159bae8"
lobstr::ref(aVector)
#> [1:0x5581d159bae8] <character>
# a single reference (the start of the vector)
aList <- list("hello", "world")
lobstr::obj_addr(aList)
#> [1] "0x5581d1704528"
lobstr::ref(aList)
#> █ [1:0x5581d1704528] <list>
#> ├─[2:0x5581d0a7bbf0] <character>
#> └─[3:0x5581d0a7bbb8] <character>
Lists are shallow copied. The list object and it’s references are copied. The values referenced are not copied.
When the copied list is modified the list object gets a new reference. Any of the list’s references that are modified get updated:
aList <- list("hello", "world")
lobstr::ref(aList)
#> [1:0x5581d2bb1738] <list>
#> ├─[2:0x5581d13ed0b8] <character>
#> └─[3:0x5581d13ed080] <character>
# copy `aList`
anotherList <- aList
lobstr::obj_addr(anotherList)
#> [1] "0x5581d2bb1738"
# modify `anotherList`
anotherList[[1]] <- "hi"
lobstr::ref(aList, anotherList)
#> █ [1:0x5581d2bb1738] <list>
#> ├─[2:0x5581d13ed0b8] <character>
#> └─[3:0x5581d13ed080] <character>
#>
#> █ [4:0x5581d2bc7238] <list>
#> ├─[5:0x5581d156eae8] <character>
#> └─[3:0x5581d13ed080]
# `anotherList` gets a new memory address
# the second object of `anotherList` still references the same
# memory address as the second object of `aList`
Memory is used efficiently when lists are just references to values:
x <- c(1L, 2L, 3L)
lobstr::obj_size(x)
#> 64 B
y <- list(x, x, x, x)
lobstr::obj_size(y)
#> 144 B
lobstr::obj_size(list(NULL, NULL, NULL, NULL))
#> 80 B
# the size of `y` is:
# the size of `x` (64 B) +
# the size of a 4 element list (80 B)
# = 144 B
The total size of multiple lists will not be the sum of the individual lists if they share references:
x <- list(1L, 2L, 3L)
lobstr::obj_size(x)
y <- list(x, 4L)
lobstr::obj_size(y)
lobstr::obj_size(x, y)
# the total size of `x` and `y` is just the size of `y`
# `y` contains all the the references of `x`
Data frames are just lists of vectors. Their class
attribute is data.frame
and they have a row.names
attribute.
You can construct them yourself instead of using data.frame()
:
diyDataFrame <- list(
var1 = c(1, 2),
var2 = c("hello", "world")
)
attr(diyDataFrame, "class") <- "data.frame"
attr(diyDataFrame, "row.names") <- c(1L, 2L)
aDataFrame <- data.frame(
var1 = c(1, 2),
var2 = c("hello", "world")
)
identical(diyDataFrame, aDataFrame)
#> [1] TRUE
Because data frames are lists the copy on modify behavior of lists applies to data frames.
If you modify a column only the reference to that column needs to change:
dataFrame1 <- data.frame(
var1 = c("hello", "world"),
var2 = c(0.01, 0.03)
)
lobstr::ref(dataFrame1)
#> █ [1:0x5581d2b94da8] <list>
#> ├─var1 = [2:0x5581d2b95de8] <character>
#> └─var2 = [3:0x5581d2b95d68] <double>
# copy the data.frame
dataFrame2 <- dataFrame1
lobstr::ref(dataFrame1, dataFrame2)
#> █ [1:0x5581d2b94da8] <list>
#> ├─var1 = [2:0x5581d2b95de8] <character>
#> └─var2 = [3:0x5581d2b95d68] <double>
#>
#> [1:0x5581d2b94da8]
# `dataFrame2` has the same memory address as `dataFrame1`
# modify a column
dataFrame2$var2 <- c(0.05, 0.01)
lobstr::ref(dataFrame1, dataFrame2)
#> █ [1:0x5581d2b94da8] <list>
#> ├─var1 = [2:0x5581d2b95de8] <character>
#> └─var2 = [3:0x5581d2b95d68] <double>
#>
#> █ [4:0x5581d2cc22e8] <list>
#> ├─var1 = [2:0x5581d2b95de8]
#> └─var2 = [5:0x5581d2cc2428] <double>
# `dataFrame2` gets a new memory address
# the first object of `dataFrame2` still references the same
# memory address as the first object of `dataFrame1`
If you modify a row then every reference needs to change. Every column will copied to a new location in memory.
dataFrame1 <- data.frame(
var1 = c("hello", "world"),
var2 = c(0.01, 0.03)
)
# copy the data.frame
dataFrame2 <- dataFrame1
lobstr::ref(dataFrame1, dataFrame2)
#> █ [1:0x5581cf243c38] <list>
#> ├─var1 = [2:0x5581cf7578d8] <character>
#> └─var2 = [3:0x5581cf7579d8] <double>
#>
#> [1:0x5581cf243c38]
# modify a row
dataFrame2 <- list("hi", 0.9)
lobstr::ref(dataFrame1, dataFrame2)
#> █ [1:0x5581cf243c38] <list>
#> ├─var1 = [2:0x5581cf7578d8] <character>
#> └─var2 = [3:0x5581cf7579d8] <double>
#>
#> █ [4:0x5581cf5f9458] <list>
#> ├─[5:0x5581cdb11e40] <character>
#> └─[6:0x5581cdb11e78] <double>
# every reference in `dataFrame2` has changed
I think the “global string pool” concept is referring to the CHARSCP
chache.
All elements of character vectors point to unique vales in the global string pool:
x <- c("hello", "world", "hello")
lobstr::ref(
x = x,
character = TRUE
)
#> █ [1:0x5581ce078278] <character>
#> ├─[2:0x5581c9b6e798] <string: "hello">
#> ├─[3:0x5581d0d616b8] <string: "world">
#> └─[2:0x5581c9b6e798]
# the third element has the same memory
# reference as the first
This means repetition in character vectors uses less memory:
x <- c("hello", "world")
lobstr::obj_size(x)
#> 176 B
lobstr::obj_size(rep(x, 10))
#> 320 B
# the character vector repeated 10 times does not use
# x10 the memory
Further reading:
There are multiple OOP systems in R:
OOP System | Description |
---|---|
S3 |
Provided by base R.Allows functions to return results that [rich results] and nicely formatted display.Used throughout base R.Need to use if extening base R functions to work with different inputs. |
S4 |
Provided by base R. |
R6 |
Provided by the R6 package.Similar to reference classes in base R (setRefClass() , getRefClass() ).Allows you to avoid R’s copy-on-modify behaviour. |
People have different preferences for the three systems.
polymorphism - consider a function’s interface seperatly from its implementation. Different types of input can use the same function form. For example summary()
gives different outputs depending on the type of variable probided (numeric
or factor
).
encapsulation - provide users with an interface that is independent of how an object is internally implemented.
A class describes what and object is.
A method describes what an object does.