Skip to navigation

R: watch out when comparing vectors using “==”

This question was on the R-help mailing list today:

I have a data frame “test”:

test<-data.frame(x=c(1,2,3,4,5,6,7,8),y=c(2,3,4,5,6,7,8,9),total=c(7,7,8,8,9,9,10,10))
test

I have a vector "needed":

needed<-c(7,9)
needed

I need the result to look like this:

1 2 7
2 3 7
5 6 9
6 7 9

When I do the following:

result<-test[test["total"]==needed,]
result

I only get unique rows that have 7 or 9 in "total":

1 2 7
6 7 9

[...]

The solution is to use %in% instead:

test[test$total %in% needed,]

A quick aside: note that you don't select the column "total" by using test["total"], instead, you need to index correctly and use test[,"total"], or, more succinctly, use test$total for dataframes. Check also identical(test["total"],test$total) and identical(test[,"total"],test$total).

Why doesn't == work here? It works just fine if you want to find all rows where the column total equals 7:

test[test$total == 7,]

The trick is that to R, test$total, needed and 7 are all vectors. == compares them element by element. Since 7 and needed are shorter than test$total, they are recycled as often as needed to give the same length as test$total. (Note, by the way, that you can of course directly look at the logical index by typing: test$total==needed.) So what R is doing is comparing this:

> test$total
[1]  7  7  8  8  9  9 10 10

to this:

> rep(needed,length.out=length(test$total))
[1] 7 9 7 9 7 9 7 9

The two vectors just happen to coincide at test$total[1] and test$total[6]. The recycling is of course no problem when you look for matches with a single number, because:

> rep(7,length.out=length(test$total))
[1] 7 7 7 7 7 7 7 7

will do just fine.

To get the help text on %in% by the way, you need to quote it:

?"%in%"

Comments are closed.