Skip to contents

split_files_by_size() splits a vector of file paths into chunks based on their size. It is useful for managing large files or datasets that need to be processed in smaller parts.

The function groups files into chunks so that the total size of files in each chunk does not exceed the specified limit. If an individual file is larger than the limit, it will be placed in its own chunk. The files are sorted by size in decreasing order before chunking.

Usage

split_files_by_size(
  files,
  max_size = fs::fs_bytes("1GB"),
  order_by_size = TRUE,
  decreasing_size = FALSE,
  root = NULL
)

Arguments

files

A character vector of file paths.

max_size

(optional) An integer or fs_bytes value specifying the maximum total size (in bytes) allowed for each chunk (default: fs_bytes("1GB")).

order_by_size

(optional) A logical flag indicating whether to sort the files by size before chunking (default: TRUE).

decreasing_size

(optional) A logical flag indicating whether to sort the files in decreasing order of size. This is only relevant if order_by_size is TRUE (default: FALSE).

root

(optional) A string specifying the root directory of the files. If NULL, the function will treat the paths as absolute (default: NULL).

Value

A list of character vectors, where each vector contains file paths that fit within the specified size limit.

Examples

library(fs)
library(readr)

files <- c("file1.txt", "file2.txt", "file3.txt", "file4.txt", "file5.txt")

dir <- tempfile("dir")
dir.create(dir)

for (i in files) {
  write_lines(rep(letters, sample(1000:10000, 1)), file.path(dir, i))
}

files <- sort_files_by_size(files, root = dir)
sizes <- file_size(file.path(dir, files)) |> as.character() |> trimws()
names(sizes) <- files
sizes
#> file1.txt file3.txt file5.txt file4.txt file2.txt 
#>    "126K"    "165K"    "263K"    "285K"    "465K" 

total_size <- file_size(file.path(dir, files)) |> sum()
max_size <- fs::fs_bytes(total_size / 2)

max_size
#> 652K

split_files_by_size(
  files,
  max_size = fs_bytes(total_size / 2),
  root = dir
)
#> [[1]]
#> file1.txt file3.txt file5.txt 
#> 
#> [[2]]
#> file4.txt
#> 
#> [[3]]
#> file2.txt
#>