How to manage your files with fs

Last updated on May 2, 2021 8 min read workflow

Organizing your data, code, documents, and figures in a consistent way that’s easy to navigate is one of the most thoughtful and generous things you’ll ever do. It’s also a basic requirement for a reproducible data science project. In RStudio, this process is made easier with .Rproj files (read more) and the here package.

What `fs` does?

The fs package adds another layer of project organization by giving us access to a “uniform interface to file system operations.” Many of the fs functions aren’t new, but they now have consistent names (path_, file_, dir_, and link_) and return tidy paths (using / and not // or trailing /). Windows users have likely experienced the bane of collaborating with Unix users (and vice versa) because of the difference in file paths.

How `fs` works

We will be using Hadley Wickham’s nycflights13 package as an example project to demonstrate some of the features of fs. First we’ll get a ‘big picture’ view of the project folder and it’s contents using fs::dir_tree():

fs::dir_tree("nycflights13")
## nycflights13
## ├── DESCRIPTION
## ├── NAMESPACE
## ├── NEWS.md
## ├── R
## │   ├── airlines.R
## │   ├── airport.R
## │   ├── flights.R
## │   ├── planes.R
## │   └── weather.R
## ├── README.md
## ├── cran-comments.md
## ├── data
## │   ├── airlines.rda
## │   ├── airports.rda
## │   ├── flights.rda
## │   ├── planes.rda
## │   └── weather.rda
## ├── data-raw
## │   ├── airlines.R
## │   ├── airlines.csv
## │   ├── airports.R
## │   ├── airports.csv
## │   ├── airports.dat
## │   ├── airports.svg
## │   ├── flights.R
## │   ├── planes.R
## │   ├── planes.csv
## │   ├── weather.R
## │   └── weather.csv
## ├── man
## │   ├── airlines.Rd
## │   ├── airports.Rd
## │   ├── flights.Rd
## │   ├── planes.Rd
## │   └── weather.Rd
## ├── nycflights.Rproj
## └── revdep
##     ├── README.md
##     ├── email.yml
##     └── problems.md

dir_tree() is similar to the tree command in Linux. If we want to narrow the scope of the tree, we can specify a subfolder and use the regexp or glob arguments.

fs::dir_tree(path = "nycflights13/data-raw", regexp = "[.]csv$")
## nycflights13/data-raw
## ├── airlines.csv
## ├── airports.csv
## ├── planes.csv
## └── weather.csv

fs does more than just print folder contents to the screen. We can also construct clean file paths with fs::path()

# create path to raw data 
raw_data_path <- fs::path("nycflights13", "data-raw")
# view these files in the OS
fs::file_show(raw_data_path)

Creating folders and files (with less worry)

I have a tendency to create data and code folders frequently. The fs::dir_create() and fs::file_create() functions are great because they don’t overwrite my existing folders or files (no need for if(!file.exists("data)) statements anymore!).

# assume I already have a data-csv folder and data-csv/README.md file
fs::dir_tree("data-csv")
# what is in this file?
base::system(command = "cat 'data-csv/README.md'")

# data-csv
# └── README.md
# CSV Data Files
# ==============
# 
# PLEASE!!! ‹(•_•)› NEVER CHANGE ME! ––•––√\/––√\/––•––•––√\/––√\/––•––•

Obviously some files should not be overwritten or altered after they’ve been created.

# will we overwrite this? try to create new data folder
fs::dir_create("data-csv")
# and try to create new README.md file
fs::file_create("data-csv/README.md")
# check their contents 
base::system(command = "cat 'data-csv/README.md'")

# CSV Data Files
# ==============
# 
# PLEASE!!! ‹(•_•)› NEVER CHANGE ME! ––•––√\/––√\/––•––•––√\/––√\/––•––•

Well that’s good news.

Directory/file info in tibbles!

The best part of fs is that we can store all this meta data from a project repo and manipulate it just like any tibble in tidyverse!

# store directory inforamtion in NycF13DirInfo
NycF13DirInfo <- fs::dir_info("nycflights13", recurse = TRUE)
NycF13DirInfo %>% 
  # check the files only
  dplyr::filter(type == "file") %>% 
  # that are biggest 
  dplyr::arrange(desc(size)) %>% 
  # only view size, birth time, and change time
  dplyr::select(path, size, birth_time, change_time) %>% 
  # return top 5
  utils::head(5)
##                                 path    size          birth_time
## 1      nycflights13/data/flights.rda    4.1M 2019-09-16 10:29:09
## 2  nycflights13/data-raw/weather.csv   2.19M 2019-09-16 10:29:09
## 3 nycflights13/data-raw/airports.dat 830.38K 2019-09-16 10:29:09
## 4   nycflights13/data-raw/planes.csv  241.4K 2019-09-16 10:29:09
## 5 nycflights13/data-raw/airports.svg 226.37K 2019-09-16 10:29:09
##           change_time
## 1 2020-05-29 11:37:44
## 2 2020-05-29 11:37:44
## 3 2020-05-29 11:37:44
## 4 2020-05-29 11:37:44
## 5 2020-05-29 11:37:44

This can come in handy if you’re looking for the last time you modified a file, or if you’re looking for a particularly large/small file.

Moving and copying files

We can also easily copy files from one location to another. Below we move all the .csv files from data-raw into the parent data folder.

NycF13DirInfo %>% 
  # find all .csv files 
  dplyr::filter(stringr::str_detect(string = path, pattern = ".csv")) %>% 
  # move them into the parent data/ folder
  fs::file_copy(path = .$path, new_path =  fs::path("data-csv"))
# sanity check
fs::dir_tree("data-csv")
## data-csv
## ├── README.md
## ├── airlines.csv
## ├── airports.csv
## ├── planes.csv
## └── weather.csv

What if we’d like to read all of these files into one big data frame? We can create an fs_path which is a named character vector (with some additional coloring on capable terminals).

# csv files
csv_files <- fs::dir_ls(path = "data-csv", glob = "*.csv")
utils::str(csv_files)
##  'fs_path' Named chr [1:4] "data-csv/airlines.csv" "data-csv/airports.csv" ...
##  - attr(*, "names")= chr [1:4] "data-csv/airlines.csv" "data-csv/airports.csv" "data-csv/planes.csv" "data-csv/weather.csv"
base::class(csv_files)
## [1] "fs_path"   "character"
base::is.character(csv_files)
## [1] TRUE

We can now pass this to purrr::map_df() and get all the .csv files in a huge dataset.

AllNycFlights <- csv_files %>%
  purrr::map_df(.f = read_csv, .id = "file", col_types = cols()) 
AllNycFlights %>% glimpse()
## Rows: 30,911
## Columns: 33
## $ file         <chr> "data-csv/airlines.csv", "data-csv/airlines.csv", "data-…
## $ carrier      <chr> "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", "M…
## $ name         <chr> "Endeavor Air Inc.", "American Airlines Inc.", "Alaska A…
## $ faa          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lat          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lon          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ alt          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ tz           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ dst          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ tzone        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ tailnum      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ year         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ type         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ manufacturer <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ model        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ engines      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ seats        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ speed        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ engine       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ origin       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ month        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ day          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ hour         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ temp         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ dewp         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ humid        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ wind_dir     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ wind_speed   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ wind_gust    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ precip       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ pressure     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ visib        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ time_hour    <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Yikes! Crazy huh? Well, this isn’t very practical to work with, but we could use dplyr::filter() to get the datasets we needed.

The different values for file should correspond to the number of observations in each dataset in the nycflights13 package.

# are these the number of observations per dataset?
AllNycFlights %>% dplyr::count(file, sort = TRUE)
## # A tibble: 4 x 2
##   file                      n
##   <chr>                 <int>
## 1 data-csv/weather.csv  26115
## 2 data-csv/planes.csv    3322
## 3 data-csv/airports.csv  1458
## 4 data-csv/airlines.csv    16
# sanity check 
nycflights13::weather %>% base::nrow()
## [1] 26115
# yes

Works well with `purrr` package!

Ok, let’s time-stamp these files and export them into a today-csv folder fun!

# create today-csv folder
fs::dir_create("today-csv")
# create path
today_csv_path <- fs::path("today-csv/", base::noquote(lubridate::today()), "/")
# remove previous path from file
AllNycFlights %>%
  dplyr::mutate(file = stringr::str_remove_all(string = file, 
                                               pattern = "data-csv/")) %>% 
  # group by file
  dplyr::group_by(file) %>% 
  # keep only distinct rows (probably redundant)
  dplyr::group_map(~ dplyr::distinct(.x, .keep_all=TRUE), keep = TRUE) %>%
  purrr::walk(~.x %>% 
                 readr::write_csv(path = paste0(today_csv_path, 
                                         "_",
                                         unique(.x$file))))
# sanity check!
fs::dir_tree("today-csv")
## today-csv
## ├── 2020-05-29_airlines.csv
## ├── 2020-05-29_airports.csv
## ├── 2020-05-29_planes.csv
## └── 2020-05-29_weather.csv

Check out the vignette and package site for more information!

How to manage your files with fs

What `fs` does?

How `fs` works

Creating folders and files (with less worry)

Directory/file info in tibbles!

Moving and copying files

Works well with `purrr` package!

Martin Frigaard

R Programmer | Shiny Developer | University Instructor

How to manage your files with fs

What fs does?

How fs works

Creating folders and files (with less worry)

Directory/file info in tibbles!

Moving and copying files

Works well with purrr package!

Martin Frigaard

R Programmer | Shiny Developer | University Instructor

What `fs` does?

How `fs` works

Works well with `purrr` package!