R: watch out when comparing vectors using “==”
This question was on the R-help mailing list today:
I have a data frame “test”:
test<-data.frame(x=c(1,2,3,4,5,6,7,8),y=c(2,3,4,5,6,7,8,9),total=c(7,7,8,8,9,9,10,10)) testI have a vector "needed":
needed<-c(7,9) neededI need the result to look like this:
1 2 7 2 3 7 5 6 9 6 7 9When I do the following:
result<-test[test["total"]==needed,] resultI only get unique rows that have 7 or 9 in "total":
1 2 7 6 7 9[...]
The solution is to use %in% instead:
test[test$total %in% needed,]
A quick aside: note that you don't select the column "total" by using test["total"], instead, you need to index correctly and use test[,"total"], or, more succinctly, use test$total for dataframes. Check also identical(test["total"],test$total) and identical(test[,"total"],test$total).
Why doesn't == work here? It works just fine if you want to find all rows where the column total equals 7:
test[test$total == 7,]
The trick is that to R, test$total, needed and 7 are all vectors. == compares them element by element. Since 7 and needed are shorter than test$total, they are recycled as often as needed to give the same length as test$total. (Note, by the way, that you can of course directly look at the logical index by typing: test$total==needed.) So what R is doing is comparing this:
> test$total [1] 7 7 8 8 9 9 10 10
to this:
> rep(needed,length.out=length(test$total)) [1] 7 9 7 9 7 9 7 9
The two vectors just happen to coincide at test$total[1] and test$total[6]. The recycling is of course no problem when you look for matches with a single number, because:
> rep(7,length.out=length(test$total)) [1] 7 7 7 7 7 7 7 7
will do just fine.
To get the help text on %in% by the way, you need to quote it:
?"%in%"