Argus Media: Home assignment — Data scientist, Curves

Author

Daniel Vartanian

Published

July 16, 2025

Overview

This report presents a fully reproducible data pipeline developed as part of a home assignment for the Data Scientist position (Curves) at the Argus Media Group, a leading independent provider of global energy and commodity market intelligence.

The second part of this assignment—a Shiny application—can be accessed here, with its code repository available here (see the note below).

This website is only visible for those with the link. It is not indexed by search engines and is not intended for public access.

Please note that the code repositories are private and require access permissions. If you would like access, please contact the author.

Assignment

This section presents the assignment instructions as provided by the Argus Media Group. For clarity, some links, emphasis, and minor corrections have been added. The original document is available here.

About this Test

This test is composed of two parts.

In the first one, you need to show your abilities with data.table or base R.
In the second problem, you need to develop a Shiny dashboard for visualization.

Deliverables

Problem 1: A single R script with the code used to populate the missing values.

Problem 2: Create an interactive Shiny application that showcases the visualization from the previous exercise.

Files should be sent to: sueidi.souza@argusmedia.com with subject line: Test Results - Data Science, Latin America.

Problem 1

The exercise requires the candidate to reproduce a simple example in R.

The relevant data is in the Excel file data_P2_ds_test.xlsx.

The task involves data wrangling skills for populating one table with missing values. Please proceed with the instructions below:

The objective is to populate the market’s missing values.
- In the tab MARKET, we have incomplete data on six [correction: seven] markets in wide format.
  - As you may observe, the market data starts and ends on different terms.
- In the tab REFERENCE, we have data on three reference markets in long format.
  - To populate the columns on the MARKET tab we apply the factors obtained from the reference curve on the latest available market data.
- In column I on the tab MARKET, we provide an example (MARKET_8) of how these markets should be computed.
Table below discloses which reference to apply respectively to each market.

MARKET	REFERENCE
MARKET_1	REF_MARKET_2
MARKET_2	MARKET_1
MARKET_3	REF_MARKET_2
MARKET_4	REF_MARKET_1
MARKET_5	REF_MARKET_3
MARKET_6	MARKET_5
MARKET_7	REF_MARKET_2

You should provide an elegant solution that calculates all market values in one go using data.table.
The final output table must have:
- Complete market data with 48 data points each, starting from 2024-04-01 [correction: 2025-04-01].
- No missing values.

Problem 2

Create a Shiny app that takes the data from Problem 1 and displays the dates and the data for the Market in an editable table. The user needs to be able to edit the data for the reference market on any date (for instance, changing March 2026 from 73.51460126 to 77.00), and the app will show a plot with the original Market values (Original) and the new Market values (New) calculated with the new data points.

Methods

Source of Data

The data used in this report were provided by the Argus Media Group as part of the home assignment for the Data Scientist position. The original dataset is available here.

Data Munging

The data munging followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.

Figure 1: Data science workflow created by Wickham, Çetinkaya-Rundel, and Grolemund.

Code Style

The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.

Reproduction

The pipeline is fully reproducible and can be run again at any time. The renv package is used to manage the project dependencies, ensuring that the code runs with the same package versions as when it was developed.

Problem 1: Missing Data

Below is a unified pipeline that produces the required results using data.table and base R. The following sections break down each step of the data wrangling process.

#' Argus Media: Home assignment — Data scientist, Curves
#'
#' @description
#'
#' This file presents a fully reproducible data pipeline developed as part of
#' a home assignment for the Data Scientist position at the
#' [Argus Media Group](https://www.argusmedia.com),
#' a leading independent provider of global energy and commodity market
#' intelligence.
#'
#' For a detailed report—including code, explanations, and outputs—visit:
#' https://danielvartan.github.io/argus/
#'
#' @author Daniel Vartanian <danvartan@gmail.com> <linktr.ee/danielvartan>
#' @date 2025-07-11
#'
#' @license CC0 1.0 Universal (Public Domain Dedication)
#'
#' @references
#'
#' Vartanian, D. (2025). *Argus Media: Home assignment — Data scientist,
#' Curves* [Report]. https://danielvartan.github.io/argus

## Load Required Packages and Functions -----

library(data.table)
library(openxlsx) # -> Neither `data.table` nor base R provide native
                  #    support for reading `.xlsx` files.

accumulate_2 <- function(x, y, f, ...) {
  inputs <- Map(\(xi, yi) list(xi, yi), x, y)

  Reduce(
    \(acc, pair) f(acc, pair[[1]], pair[[2]], ...),
    inputs[-1],
    init = f(inputs[[1]][[1]], inputs[[1]][[1]], inputs[[1]][[2]], ...),
    accumulate = TRUE
  )
}

factor_mult <- function(previous, current, factor) {
  fifelse(!is.na(current), current, round(previous * factor, 2))
}

interpolate <- function(data, group) {
  market <- group |> unlist() |> unname()

  out <-
    data |>
    _[, reference_data[.SD, on = .(TERM, REF_MARKET)]] |>
    _[, VALUE := accumulate_2(VALUE, FACTOR, factor_mult)]

  if ((market %in% lookup$REF_MARKET) &&
      (!market %in% reference_data$REF_MARKET)) {
    reference_data <<-
      list(
        reference_data,
        out |>
          _[, !c("REF_MARKET", "FACTOR")] |>
          _[, c("REF_MARKET", "VALUE") := .(market, VALUE / shift(VALUE))] |>
          _[!is.na(VALUE)] |>
          setnames("VALUE", "FACTOR") |>
          _[, .(TERM, REF_MARKET, FACTOR)]
      ) |>
        rbindlist()
  }

  out[, VALUE]
}

## Import, Tidy, and Transform the Data -----

raw_data_file <- file.path("data-raw", "data_P2_ds_test.xlsx")

lookup <- rowwiseDT(
  MARKET=,     REF_MARKET=,
  "MARKET_1", "REF_MARKET_2",
  "MARKET_2", "MARKET_1",
  "MARKET_3", "REF_MARKET_2",
  "MARKET_4", "REF_MARKET_1",
  "MARKET_5", "REF_MARKET_3",
  "MARKET_6", "MARKET_5",
  "MARKET_7", "REF_MARKET_2"
)

market_data <-
  raw_data_file |>
  read.xlsx("MARKET") |>
  as.data.table() |>
  _[, TERM := as.IDate(TERM, origin = "1899-12-30")] |>
  melt(
    id.vars = "TERM",
    variable.name = "MARKET",
    value.name = "VALUE"
  ) |>
  _[order(MARKET, TERM)]

reference_data <-
  raw_data_file|>
  read.xlsx("REFERENCE_MARKET") |>
  as.data.table() |>
  _[, TERM := as.IDate(TERM, origin = "1899-12-30")] |>
  _[, VALUE := VALUE / shift(VALUE), by = REF_MARKET] |>
  _[!is.na(VALUE)] |>
  setnames("VALUE", "FACTOR") |>
  _[order(REF_MARKET, TERM)]

## Interpolate the Data -----

market_data <-
  market_data |>
  _[, lookup[.SD, on = .(MARKET)]] |>
  _[order(MARKET, TERM)] |>
  _[, VALUE := interpolate(.SD, .BY), by = MARKET] # "in one go"

## Arrange, Clean and Pivot the Data -----

market_data <-
  market_data |>
  _[, .(TERM, MARKET, VALUE)] |>
  dcast(TERM ~ MARKET, value.var = "VALUE") |>
  _[!is.na(TERM)] |>
  _[order(TERM)] |>
  na.omit()

## Final Output -----

market_data

Set the Environment

The openxlsx package is used to read Excel data, since neither data.table nor base R provide native support for reading .xlsx files.

While data.table offers the fread() function for fast data import, it only supports text-based formats such as CSV. To work with Excel files, you can either use a package like openxlsx or convert the file to CSV using a shell command such as in2csv before importing.

Code

library(data.table)
library(openxlsx) # See note above.

Add the Lookup Table

Code

lookup <- rowwiseDT(
  MARKET=,     REF_MARKET=,
  "MARKET_1", "REF_MARKET_2",
  "MARKET_2", "MARKET_1",
  "MARKET_3", "REF_MARKET_2",
  "MARKET_4", "REF_MARKET_1",
  "MARKET_5", "REF_MARKET_3",
  "MARKET_6", "MARKET_5",
  "MARKET_7", "REF_MARKET_2"
)

Import the Data

Code

raw_data_file <- file.path("data-raw", "data_P2_ds_test.xlsx")

Code

market_data <- raw_data_file |> read.xlsx("MARKET")

Code

reference_data <- raw_data_file|> read.xlsx("REFERENCE_MARKET")

Tidy the Data

Code

market_data <-
  market_data |>
  as.data.table() |>
  _[, TERM := as.IDate(TERM, origin = "1899-12-30")]

Code

reference_data <-
  reference_data |>
  as.data.table() |>
  _[, TERM := as.IDate(TERM, origin = "1899-12-30")]

Transform the Data

Code

reference_data <-
  reference_data |>
  _[, VALUE := VALUE / shift(VALUE), by = REF_MARKET] |>
  _[!is.na(VALUE)] |>
  setnames("VALUE", "FACTOR") |>
  _[order(REF_MARKET, TERM)]

Code

market_data <-
  market_data |>
  melt(
    id.vars = "TERM",
    variable.name = "MARKET",
    value.name = "VALUE"
  ) |>
  _[order(MARKET, TERM)]

Interpolate the Data

Code

accumulate_2 <- function(x, y, f, ...) {
  inputs <- Map(\(xi, yi) list(xi, yi), x, y)

  Reduce(
    \(acc, pair) f(acc, pair[[1]], pair[[2]], ...),
    inputs[-1],
    init = f(inputs[[1]][[1]], inputs[[1]][[1]], inputs[[1]][[2]], ...),
    accumulate = TRUE
  )
}

Code

factor_mult <- function(previous, current, factor) {
  fifelse(!is.na(current), current, round(previous * factor, 2))
}

Code

interpolate <- function(data, group) {
  market <- group |> unlist() |> unname()

  out <-
    data |>
    _[, reference_data[.SD, on = .(TERM, REF_MARKET)]] |>
    _[, VALUE := accumulate_2(VALUE, FACTOR, factor_mult)]

  if ((market %in% lookup$REF_MARKET) &&
      (!market %in% reference_data$REF_MARKET)) {
    reference_data <<-
      list(
        reference_data,
        out |>
          _[, !c("REF_MARKET", "FACTOR")] |>
          _[, c("REF_MARKET", "VALUE") := .(market, VALUE / shift(VALUE))] |>
          _[!is.na(VALUE)] |>
          setnames("VALUE", "FACTOR") |>
          _[, .(TERM, REF_MARKET, FACTOR)]
      ) |>
        rbindlist()
  }

  out[, VALUE]
}

Code

market_data <-
  market_data |>
  _[, lookup[.SD, on = .(MARKET)]] |>
  _[order(MARKET, TERM)] |>
  _[, VALUE := interpolate(.SD, .BY), by = MARKET]

Arrange and Pivot the Data

Code

market_data <-
  market_data |>
  _[, .(TERM, MARKET, VALUE)] |>
  dcast(TERM ~ MARKET, value.var = "VALUE") |>
  _[!is.na(TERM)] |>
  _[order(TERM)] |>
  na.omit()

Data Dictionary

Code

library(labelled)

Code

metadata <-
  market_data |>
  `var_label<-`(
    list(
      TERM = "Date of the market value",
      MARKET_1 = "Market 1 value",
      MARKET_2 = "Market 2 value",
      MARKET_3 = "Market 3 value",
      MARKET_4 = "Market 4 value",
      MARKET_5 = "Market 5 value",
      MARKET_6 = "Market 6 value",
      MARKET_7 = "Market 7 value",
      MARKET_8 = "Market 8 value"
    )
  ) |>
  generate_dictionary(details = "full") |>
  convert_list_columns_to_character()

Save the Valid Data

Code

library(here)
library(readr)

Data

Code

market_data |>
  readr::write_csv(
    here::here("data", "valid-data.csv")
  )

Code

market_data |>
  readr::write_rds(
    here::here("data", "valid-data.rds")
  )

Metadata

Code

metadata |>
  readr::write_csv(
    here::here("data", "metadata.csv")
  )

Code

metadata |>
  readr::write_rds(
    here::here("data", "metadata.rds")
  )

Visualize the Data

Code

library(brandr)
library(ggplot2)

Code

market_data <-
  market_data |>
  _[, TERM := as.IDate(TERM)] |>
  melt(
    id.vars = "TERM",
    variable.name = "MARKET",
    value.name = "VALUE"
  ) |>
  _[, MARKET := gsub("MARKET_", "", MARKET) |> as.integer()]

Code

market_data |>
  ggplot(
    aes(
      x = TERM,
      y = VALUE,
      color = as.factor(MARKET)
    )
  ) +
  geom_smooth(
    method = "gam",
    se = FALSE,
    formula = y ~ poly(x, 12)
  ) +
  scale_x_date(
    breaks = c(
      as.Date("2025-04-01"),
      as.Date("2026-01-01"),
      as.Date("2027-01-01"),
      as.Date("2028-01-01"),
      as.Date("2029-01-01")
    ),
    date_labels = "%b %Y"
  ) +
  scale_color_brand_d() +
  labs(
    x = "Term",
    y = "Market Value",
    linetype = "Market",
    color = "Market",
  )

Problem 2: Shiny Application

The second part of this assignment—a Shiny application—can be accessed here, with its code repository available here (see the note below).

Please note that the code repositories are private and require access permissions. If you would like access, please contact the author.

How to Cite

To cite this work, please use the following format:

Vartanian, D. (2025). Argus Media: Home assignment — Data scientist, Curves [Report]. https://danielvartan.github.io/argus

A BibTeX entry for LaTeX users is

@techreport{vartanian2025,
  title = {Argus Media: Home assignment — Data scientist, Curves},
  author = {{Daniel Vartanian}},
  year = {2025},
  address = {São Paulo},
  langid = {en},
  url = {https://danielvartan.github.io/argus}
}

License

This content is licensed under CC0 1.0 Universal, placing these materials in the public domain. You may freely copy, modify, distribute, and use this work, even for commercial purposes, without permission or attribution.

References

Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (n.d.). Quarto [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5960048

R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz