Archivists at three of London's largest public collections confirmed this week they are mid-way through a coordinated exercise to identify and replace duplicate images embedded across their digital catalogues — a problem that has quietly inflated storage costs and degraded search accuracy for years. The work, which began in earnest on 30 June, is expected to run through the end of July.
The timing matters. The UK government's Autumn 2025 spending review ring-fenced £4.2 million for digitisation improvement across national and regional archives — but institutions that cannot demonstrate clean, deduplicated holdings risk losing their share of that allocation. For collections already stretched by post-pandemic backlogs, the stakes are unusually high.
Where the Problem Is Worst
The London Metropolitan Archives on Northampton Road in Clerkenwell is understood to be the largest single site involved. Staff there have been working through a batch-processing pipeline that flags near-identical image files — photographs, maps, and scanned documents — where multiple versions were uploaded at different resolutions or file formats over the past decade. The result is that a single Victorian street photograph of, say, Cable Street in Stepney might exist in a catalogue under four or five separate reference numbers, each consuming server space and each potentially appearing as a distinct result when researchers run keyword searches.
The Wellcome Collection on Euston Road is running a parallel exercise focused specifically on its medical history photographic archive, which contains an estimated 250,000 digitised items. A significant proportion of those were ingested during two rapid digitisation drives in 2017 and 2021, when speed rather than deduplication was the priority. Both institutions declined to provide named spokespeople for this piece, but publicly available tender documents published on the government's Contracts Finder portal in May 2026 outline the scope of the work and the vendors involved.
The Museum of London Docklands, which holds major collections tied to the history of trade and migration through the East End, is also participating, according to the same procurement documentation. Its digital team has been using open-source perceptual hashing tools — software that compares images not by filename but by pixel content — to catch duplicates that slipped through earlier quality checks.
What the Data Shows
Pilot audits conducted earlier this year across a sample of roughly 40,000 items at one of the participating institutions found that approximately 12 percent of image records were duplicates or near-duplicates. Extrapolated across a collection of 250,000 items, that figure implies tens of thousands of redundant files consuming live storage. Cloud storage for archival-grade image files is not cheap: commercial providers typically charge institutions on public-sector frameworks somewhere in the range of £18 to £22 per terabyte per month for warm storage, meaning files that remain accessible rather than fully offline.
Beyond cost, duplicates cause practical harm. When a researcher searches the London Metropolitan Archives online catalogue for images of Bermondsey tanneries, duplicate records push genuine distinct items further down results pages. Teachers, documentary makers, and academic historians have long complained about the problem informally, though no formal user-satisfaction survey has been published to date by any of the three institutions.
The deduplication drive also connects to a broader policy pressure. Sadiq Khan's office has signalled interest in a unified London digital heritage portal that would aggregate search results across the capital's major collections — a project that makes no sense if the underlying catalogues are polluted with redundant entries.
Institutions have until 31 July to submit revised asset inventories if they want their deduplication work to count toward the government's spending review compliance requirements. For collections that fall short, the practical consequence is likely a deferral of digitisation grant funding rather than an immediate financial penalty — but in a year when every heritage budget is under pressure, deferral is punishment enough.