= c("dog", "dog", "cat", "cat", "cat", "dog", "bird", "dog", "bird")
x = factor(x)
fx fx
[1] dog dog cat cat cat dog bird dog bird
Levels: bird cat dog
Ben Tupper
September 30, 2024
Have you ever been frustrated by factors
in R? factors
are vectors where elements have been grouped into categories which are called “levels”. Recently we had a discussion about what makes factors
sometimes seem opaque. One thing we agreed upon is that the nomenclature (“factors” and “levels”) aren’t as intuitive as other names might be such as “categoricals” and “groups” (or “categories”). Fortunately, a rose by any other name smells as sweet.
Many operations in data science manipulations depend upon factored (categorical! grouped!) data. In R this is very obvious when splitting data sets, plotting when coloring by group and when performing by-group statistics.
The forcats R package from the tidyverse does a masterful job of helping users navigate code factors. But there’s no harm in looking to the base R utilities to gain a better handle of factors.
Here we have a vector of strings (characters!) This the most obvious case - it just makes sense right out of the box. We can ask R to group these (factor them!) which it does readily in alphabetical order.
[1] dog dog cat cat cat dog bird dog bird
Levels: bird cat dog
You can get a vector of the levels.
You can count the number of levels in the factor.
Now this gets a little trickier. Suppose you wanted to know what level (group? category?) each element belongs to. R can tell you the indecies into the levels vector.
Whoa! Say what?
Well, R is telling us that the first two elements in fx belong to the level 3 group - which is “dog”. The next three elements belong to the “cat” level which is the 2nd level. Did you catch that?
What if you want the order to be dogs, cats and then birds? Just specify those as the levels
argument.
Equally intuitive is the idea behind factoring integer vectors. Note that we indicate to R that we are specifying integers with the trailing “L” after each number. The “L” comes from “long integer” which has it’s own [history](https://www.techopedia.com/definition/24004/long-integer.
Here you can see that the levels (groups) are 0, 3 and 9. But if we ask for the levels you’ll see that internally R is helding them as characters (strings)!
That’s just the way R handles it - it maintains the groupings (levels) as characters which are the most intuitive categorical data types.
So what happens when you ask for the fatcors as.numeric()
?
Oh, it’s the indices again, just like with the animal example above.
So, you should be pausing here and thinking about how R will make character grouping levels if we feed is real-numbers (not whole integers). We’ll provide 6 real numbers and then see what it does…
[1] 3.14 2.19 3.2 2.0001 1e-04 0
Levels: 0 1e-04 2.0001 2.19 3.14 3.2
Oh, it makes one grouping level for each input value. Well, that sort of makes sense, but also brings one the realization that factoring real numbers doesn’t have much value. What’s the point of grouping if R makes a group for every element in the vector?
What you can do to group real numbers is use cut()
.
cut()
on real numbersCut divides a set of real numbers into groups based upon boundaries (aka “breaks”). We’ll take the same collection of real numbers and cut them into groups: 0-1, 1-2, 2-3, 3-4 where the left hand boundary is inclusive.
[1] (3,4] (2,3] (3,4] (2,3] [0,1] [0,1]
Levels: [0,1] (1,2] (2,3] (3,4]
Well, 4 groups just like we spcified! This makes a bit of sense since we are cutting into groups 0-1, 1-2, 2-3, and 3-4.
The square bracket mean “inclusive” [
while the (
means “exclusive” boundaries.
So, let’s see the what we can know about the levels.
Once again, the levels (groupings) are returned to us as characters We could specify our own special group names using the labels
argument.
fx = cut(x, c(0,1,2,3,4), include.lowest = TRUE, labels = c("almost none", "low", "medium", "high"))
fx
[1] high medium high medium almost none almost none
Levels: almost none low medium high
This is different than what we have seen before - in this case the actual values have been changed to the grouping label we provided. This provides a mechanism for you to transform real numeric data to labels quickly.
And can we get back to the numeric index mapping?
Yup!
factor()
provides a means for grouping elements in a vector - they work most intuitively with character and integer vectors. Use cut()
to do similar groupings using real numbers.
Thanks for reading our blog! If you would like to learn more about the Tandy Center for Ocean Forecasting then please feel free to contact us.