ZIM Updates

From openZIM
Revision as of 09:41, 13 June 2024 by Benoit74B (talk | contribs)
Jump to navigation Jump to search

This page explains the openZIM convention regarding how we update (and delete) ZIM files.

Context

  • ZIM files are a snapshot of a content at a given point in time
  • By nature, when content is updated (which happens mostly daily for most content) we want to update the ZIM file
  • This update is expected to not happen more than once a month
  • There are however rare edge cases where we will need to provide an updated ZIM file within a single month (typically when scraper got broken and published ZIM is not functional)

How many files do we need to keep?

We need to keep multiple "versions" of the same content for mainly two reasons.

  • We want to ensure that files published are available for at least 30 days ; this is mandatory so that people have sufficient time to start and complete file download before the file is deleted, even if their Internet connection is slow / file is big
  • We want to be able to quickly restore a probably good previous version of a content, should something bad have happened (typically scraper got broken or online content has been defaced)
    • The "probably good" aspect is important, i.e. we need to avoid keeping only the two most recent build if timespan between builds can be very small, we need to prefer keeping the version from last month (or earlier if ZIM is updated more rarely) and the one from current month.

Original approach

Originally, we decided to:

  • use the YYYY-MM pattern for period in ZIM filename
  • overwrite ZIM files when two ZIMs are produced the same month
  • keep two versions of each ZIM content available for download, the last one and the one before (the one before being the ZIM produced the month before - or earlier - but not the previous build from same month)

ZIM Update v2

Before June 2024, we did not took into account the fact that ZIM files can be on rare edge updated within a single month.

In June 2024 we started a project named ZIM Update v2 to take this into account.

This is still a WIP.

TL;DR

The period of ZIM filenames is changed from YYYY-MM to YYYY-MM[.ll], with .ll being an optional letter(s) suffix, used only when ZIM are updated within a single month.

Retention rule is updated to match requirements mentioned above (keep two very distinct versions and keep files at least 30 days).

Details

The drivers for the ZIM Update v2 are:

  • v1 approach leads to overwriting the same ZIM file with new content
  • this update is not necessarily catch by download mirrors (probably working in 99% of the situation, but no strong guarantee + hard to detect when this failed)
  • the user has no easy solution to know which version of the ZIM file he is using (complexifying support cases)
  • downloads get corrupted if the ZIM file is updated during the download - not so rare given the huge ZIM files we have and the potentially slow internet connection of our users (getting a corrupted ZIM is both boring and a concern in term of wasted time and bandwidth ; not speaking about the fact that most user might not even notice it depending on the corruption and their usage of the ZIM, downloaders are not always end-users)
  • even the imager service is impacted, potentially building images with bad (old) or corrupted ZIM content

The new approach consists in:

  • keeping the same format for period of the ZIM filename on a normal basis
  • should we publish a new version of a ZIM within a single month, the period of the ZIM filename will use a new pattern: YYYY-MM.ll where ll is one or multiple alpha letter, starting from a (for first update) to z, then aa, ab, ... (should we publish more than 26 updates in a single month ... never seen so far) and so on
  • change the ZIM files retention rules:
    • keep the last version of two ZIM files from the two last distinct months (e.g. if we have `2024-04`, `2024-04.a`, `2024-06`, `2024-06.a`, `2024-06.b`, then we keep `2024-04.a` and `2024-06.b` ; hopefully this is a scenario which will never happen, but this is the algorithm)
    • AND keep every version which is 30 days old or less

This approach seems the best compromise to:

  • not change the system on a normal basis (ZIM updates within a single month are expected to be very rare)
  • make proheminent the fact that a ZIM had to be updated (it has a new "thing" in its name)
  • still use something easily understandable by the user, and easy to remember, type, communicate, ...

We considered that we do not want to:

  • add information about the day when the ZIM has been built inside the period of the ZIM filename
    • this information is not easy to grab for the user, potentially adding more confusion
    • this information is not sufficient to ensure conflict will not happen anymore (we can theoretically rebuild the same ZIM twice in a single day when a problem is quickly detected)
  • add a shortened version of the ZIM UUID instead of a letter suffix
    • this approach could make sense but seems less elegant
    • we considered this might be more difficult to read and this does not provide ordering of the ZIM produced

Expected impact of ZIM update v2

For mirror owners, the impact is expected to be minor, we very rarely rebuild the same ZIM within a single month, so impact on bandwidth / storage is going to be very small.

Mirroring software (mirrorbrain) is not impacted at all.

For scrapers, impact is null, we will take care of the renaming at upload time since the system needs to know which ZIM already exists, scrappers have no idea about it.

Zimfarm uploader will be modified to properly rename the uploaded file on-the-fly when a conflict is detected.

Library management tooling will be modified to cleanup old ZIMs according to the new retention rule

For other management software (rest of the Zimfarm, imager service, hotspot tooling, offspot/metrics, ...) and for all ZIM readers, we will need to check if the new naming convention is causing problems and/or if special care should be taken to download the updated version once available.