Dplyr summarize count if

11/10/2023

# Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla # Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental # Hornet Sportabout Valiant Duster 360 Merc 240D apply(X = is.na(mtcars), MARGIN = 1, FUN = sum) # Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive MARGIN = 1 means to apply the function across rows and MARGIN = 2 across columns. This function takes three arguments: X is the input matrix, MARGIN is an integer, and FUN is the function to apply to each row or column. That enables detecting patterns that might inform future modeling decisions.Ĭounting NAs across either rows or columns can be achieved by using the apply() function. It gets much more interesting if we look at missing values across variables and records in the dataset. sum(is.na(mtcars)) # 15Īrguably, though, the total number of missing values in a dataset is a rather crude measure. Getting the total number of NAs then is simple because sum() works with matrices as well as vectors. # Cadillac Fleetwood FALSE FALSE FALSE FALSE FALSE FALSE FALSEĪs you can see the result is a matrix of logical values. # Merc 450SLC FALSE FALSE FALSE FALSE FALSE FALSE FALSE # Merc 450SL FALSE FALSE FALSE FALSE FALSE FALSE FALSE # Merc 450SE FALSE FALSE FALSE FALSE FALSE FALSE FALSE # Merc 280C FALSE FALSE FALSE TRUE FALSE FALSE TRUE # Merc 280 FALSE FALSE FALSE FALSE FALSE FALSE FALSE # Merc 230 FALSE FALSE FALSE FALSE FALSE FALSE FALSE # Merc 240D FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# Duster 360 FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# Valiant FALSE FALSE FALSE FALSE FALSE FALSE FALSE # Hornet Sportabout FALSE FALSE FALSE FALSE FALSE FALSE FALSE # Hornet 4 Drive FALSE FALSE FALSE FALSE FALSE FALSE FALSE # Datsun 710 FALSE FALSE FALSE FALSE FALSE FALSE FALSE # Mazda RX4 Wag FALSE FALSE FALSE TRUE FALSE FALSE TRUE # Mazda RX4 FALSE FALSE FALSE TRUE FALSE FALSE TRUE The is.na() function is generic and has a method for data frames so you can directly pass it a data frame as input. To illustrate the concepts let me first add some missing values to the mtcars dataset. mean(is.na(x)) # 0.4Įnough of vectors, though, let’s look at counting missing values in a data frame. Does that “formula” look somehow familiar? Summing up all elements in a vector and dividing by the total numbers of elements, that’s calculating the arithmetic mean! So, instead of using sum() and length() we can simply use mean() to get the proportion of NAs in a vector. To get the proportion of missing values you can proceed by dividing the result of the previous operation by the length of the input vector. Thus, sum(is.na(x)) gives you the total number of missing values in x. In the process TRUE gets turned to 1 and FALSE gets converted to 0. sum(is.na(x)) # 2Ĭonfused why you can sum TRUE and FALSE values? R automatically converts logical vectors to integer vectors when using arithmetic functions. First of all, to count the total number of NAs in a vector you can simply sum() up the result of is.na(). is.na(x) # FALSE FALSE TRUE FALSE TRUEĪrmed with that knowledge let’s explore how to calculate some basic summary statistics about missing values in your data. If you insist, you’ll get a useless results. To check for missing values in R you might be tempted to use the equality operator = with your vector on one side and NA on the other.

0 Comments

Dplyr summarize count if

Leave a Reply.

Author

Archives

Categories