2023: Day 5 - If You Give A Seed A Fertilizer

Published

December 5, 2023

Setup

Part 1

Input: an almanac that lists all the seeds that need to be planted to solve Island Island food production problem.

We start with a list of seeds identified by numbers.

Mapping System: Each seed is associated, based on its number, with specific types of agricultural resources (soil, fertilizer, water, and so on), which are also identified by numbers, through a series of mappings that connect “source” numbers with “destination” numbers.

[The mappings indicate] what type of soil to use with each kind of seed, what type of fertilizer to use with each kind of soil, what type of water to use with each kind of fertilizer, and so on.

Every type of seed, soil, fertilizer and so on is identified with a number, but numbers are reused by each category - that is, soil 123 and fertilizer 123 aren’t necessarily related to each other.

For example:

seeds: 79 14 55 13

seed-to-soil map:
50 98 2
52 50 48

Rather than list every source number and its corresponding destination number one by one, the maps describe entire ranges of numbers that can be converted…

The first line has a destination range start of 50, a source range start of 98, and a range length of 2. This line means that the source range starts at 98 and contains two values: 98 and 99. The destination range is the same length, but it starts at 50, so its two values are 50 and 51. With this information, you know that seed number 98 corresponds to soil number 50 and that seed number 99 corresponds to soil number 51.

Any source numbers that aren’t mapped correspond to the same destination number. So, seed number 10 corresponds to soil number 10.

Available mappings go from seed to location (seed -> soil -> fertilizer -> water -> light -> temperature -> humidity -> location).

Find the lowest location number that corresponds to any of the initial seeds

Ideas of how to approach this problem:

Store seeds numbers as a vector.
Turn each mapping into a function.
Crazy idea: make a “function factory” (https://adv-r.hadley.nz/function-factories.html) to programatically create functions for each mapping using the numbers from each one.

Storing the seeds as a vector:

seeds <- read_lines(here("2023/day/5/input"), n_max = 1) %>%
  str_extract_all("\\d+") %>%
  as_vector() %>%
  as.numeric()

seeds

 [1] 2906422699    6916147 3075226163  146720986  689152391  244427042
 [7]  279234546  382175449 1105311711    2036236 3650753915  127044950
[13] 3994686181   93904335 1450749684  123906789 2044765513  620379445
[19] 1609835129   60050954

Now it’s time to load the mappings. I’ll start loading a single mapping in a vector as a “proof of concept”:

seed_to_soil <-
  read_lines(here("2023/day/5/input"), skip = 3, n_max = 22)

seed_to_soil

 [1] "2642418175 2192252668 3835256"   "2646253431 2276158914 101631202"
 [3] "2640809144 3719389865 1609031"   "2439110058 2377790116 121628096"
 [5] "439727986 2712085714 392957193"  "993018128 1316992003 327657967" 
 [7] "832685179 1875058987 50438969"   "2796039666 0 1107546829"        
 [9] "182253984 3569317158 150072707"  "2747884633 1826903954 48155033" 
[11] "2268424297 3406848659 162468499" "0 1644649970 182253984"         
[13] "1794130013 2499418212 105266207" "2560738154 2196087924 80070990" 
[15] "1512587867 1925497956 72096972"  "2094053960 3729216158 174370337"
[17] "1320676095 3105042907 191911772" "1899396220 1997594928 194657740"
[19] "2430892796 3720998896 8217262"   "1584684839 1107546829 209445174"
[21] "332326691 2604684419 107401295"  "883124148 3296954679 109893980"

The sample mapping was correctly loaded but I think it will be easier to work with the mappings if I transform them into a dataframe first.

mapping <- 
  seed_to_soil %>%
  map(~str_split_1(., " ")) %>%
  map(as.numeric) %>%
  map(~set_names(., c("start_dest", "start_source", "range"))) %>%
  enframe() %>%
  unnest_wider(value) %>%
  mutate(
    end_source = start_source + range - 1,
    offset = start_dest - start_source
    ) %>%
  select(start_source, end_source, offset) %>%
  arrange(start_source)

mapping

# A tibble: 22 × 3
   start_source end_source      offset
          <dbl>      <dbl>       <dbl>
 1            0 1107546828  2796039666
 2   1107546829 1316992002   477138010
 3   1316992003 1644649969  -323973875
 4   1644649970 1826903953 -1644649970
 5   1826903954 1875058986   920980679
 6   1875058987 1925497955 -1042373808
 7   1925497956 1997594927  -412910089
 8   1997594928 2192252667   -98198708
 9   2192252668 2196087923   450165507
10   2196087924 2276158913   364650230
# ℹ 12 more rows

Now I’ll create a function that receives a function as input (e.g. a seed) and:

Looks up which source range does it belong to (this is very similar to Excel’s VLOOKUP).
Returns the input number adjusted by the corresponding offset.

compute_mapping <- function(input, mapping) {
  tibble(input = input) %>%
  left_join(
    mapping, by = join_by(between(input, start_source, end_source))) %>%
  mutate(
    offset = replace_na(offset, 0),
    output = input + offset
  ) %>%
  pull(output)
}

compute_mapping(seeds, mapping)

 [1]  634064971 2802955813  802868435 2942760652 3485192057 3040466708
 [7] 3075274212 3178215115 3901351377 2798075902  263690741 2923084616
[13] 3994686181 2889944001 1126775809 2919946455 1946566805 3416419111
[19] 1285861254 2856090620

This works as expected for the sample mapping.

Now I need to generalise this procedure. For this, I have to read all the mappings from the input file as dataframes.

The following pipeline does just that, creating a nested dataframe where the first column is the name of the mapping (e.g. seed-to-soil, soil-to-fertilizer, and so on) and the second column is the mapping itself.

mappings_df <-
  read_lines(here("2023/day/5/input"), skip =2) %>%
  tibble(x = .) %>%
  filter(x != "") %>%
  mutate(name = str_match(x, '(.+) map')[,2]) %>%
  fill(name) %>%
  filter(!str_detect(x, "map")) %>%
  nest(data = x) %>%
  mutate(data = map(data, as_vector))

mappings_df

# A tibble: 7 × 2
  name                    data      
  <chr>                   <list>    
1 seed-to-soil            <chr [22]>
2 soil-to-fertilizer      <chr [28]>
3 fertilizer-to-water     <chr [48]>
4 water-to-light          <chr [38]>
5 light-to-temperature    <chr [47]>
6 temperature-to-humidity <chr [40]>
7 humidity-to-location    <chr [23]>

There’s just one problem: the mappings in the column data are not yet parsed as dataframes. They’re currently just vectors, with each item combining the source, destination, and offset into a single string.

mappings_df[['data']][[1]][1:5]

                               x1                                x2 
  "2642418175 2192252668 3835256" "2646253431 2276158914 101631202" 
                               x3                                x4 
  "2640809144 3719389865 1609031" "2439110058 2377790116 121628096" 
                               x5 
 "439727986 2712085714 392957193"

To solve this, I’m creating the function chr_to_mapping, which takes these character vectors as input and returns the mapping dataframes I want as output.

chr_to_mapping <- function(chr)  {
  chr %>%
  map(~str_split_1(., " ")) %>%
  map(as.numeric) %>%
  map(~set_names(., c("start_dest", "start_source", "range"))) %>%
  enframe() %>%
  unnest_wider(value) %>%
  mutate(
    end_source = start_source + range - 1,
    offset = start_dest - start_source
    ) %>%
  select(start_source, end_source, offset) %>%
  arrange(start_source)
}

Next, I use purrr::map to apply this function to the data column, creating the mappings list-column that contains the dataframes I need.

mappings_df <- mappings_df %>%
  mutate(
    mappings = map(data, chr_to_mapping)
  )

Then, I can simply pull the mappings, and starting with the seeds, use the compute_mapping function to apply all the mappings and determine the final location values.

mappings <- mappings_df %>% pull(mappings)

values <- seeds

for (i in seq_along(mappings)) {
  values <- compute_mapping(values, mappings[[i]])
}

The solution is just the minimum of these location values.

min(values)

[1] 177942185

Part 2

Now it turns out that the seeds are actually RANGES.

The values on the initial seeds: line come in pairs. Within each pair, the first value is the start of the range and the second value is the length of the range. So, in the first line of the example above:

seeds: 79 14 55 13

This line describes two ranges of seed numbers to be planted in the garden. The first range starts with seed number 79 and contains 14 values: 79, 80, …, 91, 92. The second range starts with seed number 55 and contains 13 values: 55, 56, …, 66, 67.

It seems the computational cost/complexity of my solution is going to scale abruptly given this new interpretation of the input.

The first thing I’ll try to do is to represent the ranges as vectors:

seeds_m <- matrix(seeds, ncol = 2, byrow = TRUE) 

colnames(seeds_m) <- c("start", "range")

df_all_seeds <- as_tibble(seeds_m) %>%
  mutate(all_seeds = map2(start, range, ~as.numeric(.x:(.x + .y))))

df_all_seeds

# A tibble: 10 × 3
        start     range all_seeds          
        <dbl>     <dbl> <list>             
 1 2906422699   6916147 <dbl [6,916,148]>  
 2 3075226163 146720986 <dbl [146,720,987]>
 3  689152391 244427042 <dbl [244,427,043]>
 4  279234546 382175449 <dbl [382,175,450]>
 5 1105311711   2036236 <dbl [2,036,237]>  
 6 3650753915 127044950 <dbl [127,044,951]>
 7 3994686181  93904335 <dbl [93,904,336]> 
 8 1450749684 123906789 <dbl [123,906,790]>
 9 2044765513 620379445 <dbl [620,379,446]>
10 1609835129  60050954 <dbl [60,050,955]>

I tried (in code that not shown here) to just run my previous code on all_seeds but it just doesn’t work: my computer runs out of memory.

Then I tried some alternative approaches that may have been more efficient, such as:

Collapsing all the mappings (seed-to-soil, soil-to-fertilizer, and so on) into just one look-up table that directly maps seeds to locations. This may save memory in the compute_mapping process. The problem is that this is easier said than done. Implementing this idea is not trivial, at least not for me. After a couple of attempts I ended up giving up with this path because I didn’t have enough time to figure out how to pull it off.
Using the intervals R package. This package seems to allow performing interval operations (such as intersections and complements) in a more efficient way, without having to explicitly load all the numbers of the interval in memory. Unfortunately, I also ran out of my “time budget” when trying to implement this idea, so I ended up giving up on it too. Still, I have a strong feeling that “this was the way” to efficiently tackle Part 2 of this problem.

Note: one of the issues I couldn’t solve was that the intervals are so huge that even trying to perform one transformation causes a cannot allocate vector of size XX.X Gb error. This problem is not addressed by the “collapsing all the mappings” idea. So, even if I had been able to implement it, I would still have ran into the same error. That’s why I suspect that the intervals approach has higher chances of success.