Factor foo

Introduction to factors
R-code
analysis
Author

Ben Tupper

Published

September 30, 2024

From https://clipart-library.com/clipart/6cr5qM4oi.htm

From https://clipart-library.com/clipart/6cr5qM4oi.htm

Fooey!

Have you ever been frustrated by factors in R? factors are vectors where elements have been grouped into categories which are called “levels”. Recently we had a discussion about what makes factors sometimes seem opaque. One thing we agreed upon is that the nomenclature (“factors” and “levels”) aren’t as intuitive as other names might be such as “categoricals” and “groups” (or “categories”). Fortunately, a rose by any other name smells as sweet.

Many operations in data science manipulations depend upon factored (categorical! grouped!) data. In R this is very obvious when splitting data sets, plotting when coloring by group and when performing by-group statistics.

The forcats R package from the tidyverse does a masterful job of helping users navigate code factors. But there’s no harm in looking to the base R utilities to gain a better handle of factors.

Factoring character vectors

Here we have a vector of strings (characters!) This the most obvious case - it just makes sense right out of the box. We can ask R to group these (factor them!) which it does readily in alphabetical order.

x = c("dog", "dog", "cat", "cat", "cat", "dog", "bird", "dog", "bird")
fx = factor(x)
fx
[1] dog  dog  cat  cat  cat  dog  bird dog  bird
Levels: bird cat dog

You can get a vector of the levels.

levels(fx)
[1] "bird" "cat"  "dog" 

You can count the number of levels in the factor.

nlevels(fx)
[1] 3

Get the level per element

Now this gets a little trickier. Suppose you wanted to know what level (group? category?) each element belongs to. R can tell you the indecies into the levels vector.

as.numeric(fx)
[1] 3 3 2 2 2 3 1 3 1

Whoa! Say what?

Well, R is telling us that the first two elements in fx belong to the level 3 group - which is “dog”. The next three elements belong to the “cat” level which is the 2nd level. Did you catch that?

Specify you own order

What if you want the order to be dogs, cats and then birds? Just specify those as the levels argument.

fx = factor(x, levels = c("dog", "cat", "bird"))
fx
[1] dog  dog  cat  cat  cat  dog  bird dog  bird
Levels: dog cat bird

Factoring integer vectors

Equally intuitive is the idea behind factoring integer vectors. Note that we indicate to R that we are specifying integers with the trailing “L” after each number. The “L” comes from “long integer” which has it’s own [history](https://www.techopedia.com/definition/24004/long-integer.

x = c(3L, 0L, 0L, 3L, 9L, 9L, 0L)
fx = factor(x)
fx
[1] 3 0 0 3 9 9 0
Levels: 0 3 9

Here you can see that the levels (groups) are 0, 3 and 9. But if we ask for the levels you’ll see that internally R is helding them as characters (strings)!

levels(fx)
[1] "0" "3" "9"

That’s just the way R handles it - it maintains the groupings (levels) as characters which are the most intuitive categorical data types.

So what happens when you ask for the fatcors as.numeric()?

as.numeric(fx)
[1] 2 1 1 2 3 3 1

Oh, it’s the indices again, just like with the animal example above.

Factoring real-number vectors

So, you should be pausing here and thinking about how R will make character grouping levels if we feed is real-numbers (not whole integers). We’ll provide 6 real numbers and then see what it does…

x = c(3.14, 2.19, 3.2, 2.0001, 0.0001, 0)
fx = factor(x)
fx
[1] 3.14   2.19   3.2    2.0001 1e-04  0     
Levels: 0 1e-04 2.0001 2.19 3.14 3.2

Oh, it makes one grouping level for each input value. Well, that sort of makes sense, but also brings one the realization that factoring real numbers doesn’t have much value. What’s the point of grouping if R makes a group for every element in the vector?

What you can do to group real numbers is use cut().

Use cut() on real numbers

Cut divides a set of real numbers into groups based upon boundaries (aka “breaks”). We’ll take the same collection of real numbers and cut them into groups: 0-1, 1-2, 2-3, 3-4 where the left hand boundary is inclusive.

fx = cut(x, c(0,1,2,3,4), include.lowest = TRUE)
fx
[1] (3,4] (2,3] (3,4] (2,3] [0,1] [0,1]
Levels: [0,1] (1,2] (2,3] (3,4]

Well, 4 groups just like we spcified! This makes a bit of sense since we are cutting into groups 0-1, 1-2, 2-3, and 3-4.

The square bracket mean “inclusive” [ while the ( means “exclusive” boundaries.

So, let’s see the what we can know about the levels.

nlevels(fx)
[1] 4
levels(fx)
[1] "[0,1]" "(1,2]" "(2,3]" "(3,4]"

Once again, the levels (groupings) are returned to us as characters We could specify our own special group names using the labels argument.

fx = cut(x, c(0,1,2,3,4), include.lowest = TRUE, labels = c("almost none", "low", "medium", "high"))
fx
[1] high        medium      high        medium      almost none almost none
Levels: almost none low medium high

This is different than what we have seen before - in this case the actual values have been changed to the grouping label we provided. This provides a mechanism for you to transform real numeric data to labels quickly.

And can we get back to the numeric index mapping?

as.numeric(fx)
[1] 4 3 4 3 1 1

Yup!

Summary

factor() provides a means for grouping elements in a vector - they work most intuitively with character and integer vectors. Use cut() to do similar groupings using real numbers.


Contact us

Thanks for reading our blog! If you would like to learn more about the Tandy Center for Ocean Forecasting then please feel free to contact us.