R - Pitfalls in converting a factor to a numeric value

Sometimes when you’ve done some data manipulation or read in a new file, it might happen that a numerical attribute, like an id, is stored as a factor. Let’s have a look at the following example: Here I simply defined an vector of numbers (e.g. ids), and converted the values to factors.

id <- factor(seq(10000, 20000, 1))
> str(id)
 Factor w/ 10001 levels "10000","10001",..: 1 2 3 4 5 6 7 8 9 10 ...

In one of my scripts I wanted to convert the factor back to a numerical value. For this purpose, I used the following function without actually having a look at the result:

> str(as.numeric(id))
 num [1:10001] 1 2 3 4 5 6 7 8 9 10 …

You will see that a numerical vector is returned, which starts from 1 and goes up to 10001 (instead from 10000 to 20000, what I expected). In retrospect this seemed logical, since factors don’t care if their values look like numbers or a characters. But in practice this might lead to a lot of confusion, especially if you try to join different datasets by an id which was converted in the wrong way. So, if you want to convert a factor back to a numerical value, you should use the following lines instead:

> str(as.numeric(as.character(id)))
 num [1:10001] 10000 10001 10002 10003 10004 …

comments powered by Disqus