Batch deduplication for large searches — batch

This function splits citations into batches, deduplicates each batch, and binds results together into one dataframe of unique citations.

Usage

batch_dedup(
  citations,
  batch_n = 50000,
  keep_source = NULL,
  keep_label = NULL,
  sort_by = c("year", "title", "author")
)

Arguments

citations: A dataframe containing citation information.
batch_n: Numeric value specifying the maximum number of citations per batch. Default is 50000.
keep_source: Character vector specifying the citation source(s) to preferentially retain in the dataset as the unique record.
keep_label: Character vector specifying the citation label(s) to preferentially retain in the dataset as the unique record.
sort_by: Character vector specifying the sorting criteria. Default is c("year", "title","author"). Valid options are column names of the citations dataframe.

Value

A list with components:

unique - dataframe containing unique citations.
manual_dedup - dataframe containing citations to be manually checked for duplicates.

Details

The following fields will be used in citations (if provided): record_id, author, year, journal, doi, title, pages, volume, number, abstract, isbn, label, source

Examples


# Perform batch deduplication
result <- batch_dedup(citations_df, batch_n = 250)
#> Splitting up dataframe
#> batch 1 complete ✔
#> batch 2 complete ✔
#> batch 3 complete ✔
#> batch 4 complete ✔
#> batch 5 complete ✔
#> identified 611 unique citations

# View unique citations
head(result$unique)
#> # A tibble: 6 × 16
#>   duplicate_id author     year  journal doi   title pages volume number abstract
#>   <chr>        <chr>      <chr> <chr>   <chr> <chr> <chr> <chr>  <chr>  <chr>   
#> 1 1075         Doenst T.… 1996  Am J P… 10.1… Fast… H160… 270    5 Pt 2 We test…
#> 2 1092         Dorheim T… 1991  Surgery NA    Enha… 136-… 110    2      Reversi…
#> 3 1182         Erikson J… 1996  Am Hea… 10.1… Endo… 84-90 132    1 Pt 1 This st…
#> 4 1184         Eskildsen… 1996  Ann N … 10.1… Expl… 210-… 793    NA     NA      
#> 5 1210         Faris B.,… 1997  Ann Th… 10.1… Fail… 1735… 64     6      BACKGRO…
#> 6 1211         Fatehi-Ha… 1997  Eur J … 10.1… Geni… 67-70 338    1      The pos…
#> # ℹ 6 more variables: isbn <chr>, secondary_title <chr>, label <chr>,
#> #   url <chr>, source <chr>, record_ids <chr>