Difference between revisions of "Content team/ZIM Naming Convention"

From openZIM
Jump to navigation Jump to search
(Copy page from Github wiki to openZIM wiki)
 
(better explain the lang + use domain instead of project in ZIM Name format)
Line 1: Line 1:
<blockquote>This page was originally located at https://github.com/openzim/overview/wiki/ZIMs-Naming-Convention </blockquote>
<blockquote>This page was originally located at https://github.com/openzim/overview/wiki/ZIMs-Naming-Convention </blockquote>This page explains the naming convention use both for the ZIM `Name` metadata and the ZIM filename, for ZIMs published by openZIM.


This is an openZIM convention, i.e. other publishers are free to follow the same convention or develop their own.
=== Context ===
=== Context ===


Line 21: Line 22:


=== ZIM <code>Name</code> Metadata ===
=== ZIM <code>Name</code> Metadata ===
Format: '''<code>{project}_{lang}_{selection}</code>'''
Format: '''<code>{domain}_{lang}_{selection}</code>'''


The <code>_</code> character is reserved as separator between the parts.  
The <code>_</code> character is reserved as separator between the parts.  
Line 34: Line 35:
!Example
!Example
|-
|-
|<code>project</code>
|<code>domain</code>
|Domain name (or project) <sup>1</sup>
|Domain name (or project) <sup>1</sup>
|<code>android.stackexchange.com</code>, <code>wikipedia</code>
|<code>android.stackexchange.com</code>, <code>wikipedia</code>
|-
|-
|<code>lang</code>
|<code>lang</code>
|ISO-639-1 (2 chars) language code
|ISO-639 language code or <code>mul</code> <sup>2</sup>
|<code>en</code>, <code>fr</code>, <code>zh</code>, <code>mul</code><sup>2</sup>
|<code>en</code>, <code>fr</code>, <code>zh</code>, <code>mul</code>
|-
|-
|<code>selection</code>
|<code>selection</code>
Line 47: Line 48:
|}
|}


* <sup>1</sup> Domain name by default, project names are exceptions (basically valid only if we at least have a dedicated category for this project); use domain names if unsure, or best, ask on Slack. Should domain name could contains illegal characters for our convention, it will be encoded with Punycode, e.g. https://www.punycoder.com/)
* <sup>1</sup> By default, use the web domain name associated with the content (including for Youtube channels, ...). Project names are exceptions (basically valid only if we at least have a dedicated category for this project); use domain names if unsure, or best, ask on Slack. Should domain name could contains illegal characters for our convention, it will be encoded with Punycode, e.g. https://www.punycoder.com/)
* <sup>2</sup> <code>mul</code> is to be used for multiple-language ZIMs. Note that the ZIM <code>Language</code> metadata lists the languages (ISO-639-3) instead of using <code>mul</code>
*2 Whenever possible, prefer to use the ISO-639-1 (2 chars) language code. When the ISO-639-1 code does not exists or is ambiguous (leading to conflict of ZIM Name between two different ZIMs), using the ISO-639-3 is recommended. When multiple languages are present inside the ZIM, <code>mul</code> is to be used. Note that the ZIM <code>Language</code> metadata lists all the languages (ISO-639-3) instead of using <code>mul</code>


=== ZIM filename ===
=== ZIM filename ===
Line 79: Line 80:
* <sup>1</sup> It doesn't need to be the equal to the `Name` metadata but requirements identical.
* <sup>1</sup> It doesn't need to be the equal to the `Name` metadata but requirements identical.


=== Zimfarm ===
=== Implementation on the Zimfarm ===
Depending on the scraper, setting the <code>Name</code> metadata in the Zimfarm can be mandatory (follow above instructions) or optional. When optional, the scraper usually properly sets it according to the convention. Should it not, open a ticket on the scraper repo and set it manually in the recipe until it is fixed.
Depending on the scraper, setting the <code>Name</code> metadata in the Zimfarm can be mandatory (follow above instructions) or optional. When optional, the scraper usually properly sets it according to the convention. Should it not, open a ticket on the scraper repo and set it manually in the recipe until it is fixed.



Revision as of 08:37, 13 June 2024

This page was originally located at https://github.com/openzim/overview/wiki/ZIMs-Naming-Convention

This page explains the naming convention use both for the ZIM `Name` metadata and the ZIM filename, for ZIMs published by openZIM.

This is an openZIM convention, i.e. other publishers are free to follow the same convention or develop their own.

Context

  • When publishing a ZIM, it's important to pay attention to its metadata as those are the way other people will distinguish it from other content
  • Metadata lists the common and required metadata expected for a ZIM file
  • None of them needs to be unique. ZIMs already includes an identifier (called ID that is a UUID) that is generated automatically during creation. It doesn't diminishes the value of the other metadata though. You still want readers to easily and confidently choose ZIMs according to those.
  • We need to ensure collisions will not happen (two different websites leading to the same ZIM Name typically) and that the user understand which source content he is downloading / using
  • Choosing good and appropriate metadata can be difficult, but it's not what this document is about.

This document is about setting valid Name metadata and filename for openZIM-created ZIMs (usually via the Zimfarm).

Why do we care?

  • We create thousands of ZIMs every month. Convention is essential to be able to automate some tasks.
  • Convention means applying a pattern, so no need to find what to use: simpler, faster.
  • We use Name metadata to match Zimfarm-produced ZIMs with *Titles* in the CMS
  • We use Name metadata to set the ZIM filename in most scrapers.
  • Many scripts depends on the filenames to maintain the central library: build the XML library, move files to appropriate folder, evict older files, generate redirects, etc.
  • Offspot YAML catalog uses *Human IDs* that are derived from the filenames.

ZIM Name Metadata

Format: {domain}_{lang}_{selection}

The _ character is reserved as separator between the parts.

The parts must only contain alphanums or - or . characters.

The parts must be all lowercase.

Components of ZIM Name Metadata
Part Description Example
domain Domain name (or project) 1 android.stackexchange.com, wikipedia
lang ISO-639 language code or mul 2 en, fr, zh, mul
selection A short, slug-like string indicating the selection over the project all, top, football
  • 1 By default, use the web domain name associated with the content (including for Youtube channels, ...). Project names are exceptions (basically valid only if we at least have a dedicated category for this project); use domain names if unsure, or best, ask on Slack. Should domain name could contains illegal characters for our convention, it will be encoded with Punycode, e.g. https://www.punycoder.com/)
  • 2 Whenever possible, prefer to use the ISO-639-1 (2 chars) language code. When the ISO-639-1 code does not exists or is ambiguous (leading to conflict of ZIM Name between two different ZIMs), using the ISO-639-3 is recommended. When multiple languages are present inside the ZIM, mul is to be used. Note that the ZIM Language metadata lists all the languages (ISO-639-3) instead of using mul

ZIM filename

Format: {Name}[_{flavour}]_{period}.zim

The _ character is reserved as separator between the parts.

The parts must only contain alphanums or - or . characters.

The filename must be all lowercase.

Components of ZIM filename
Part Description Example
Name The Name metadata described above 1 wikipedia_fr_top, wikihow_th_all, stackoverflow.com_en_all
flavour Optional. One of the existing flavour indicating a modification of the content for size reasons mini, nopic, maxi
period The period when the ZIM has been created, in format YYYY-MM (year-month) 2019-03, 2022-12
  • 1 It doesn't need to be the equal to the `Name` metadata but requirements identical.

Implementation on the Zimfarm

Depending on the scraper, setting the Name metadata in the Zimfarm can be mandatory (follow above instructions) or optional. When optional, the scraper usually properly sets it according to the convention. Should it not, open a ticket on the scraper repo and set it manually in the recipe until it is fixed.

Filenames are also optional in the Zimfarm but the common behavior is to append the period-part (ex: _2022-01 after the value of the Name metadata. If you customized the Name, make sure the filename will remain valid or set it manually.

Important: when setting filename manually, you are responsible for the whole filename, including the period part. Most scraper allow inserting a special `{period}` string that will be replaced with the year-date one. Ex: supersite.com_en_all_{period}.zim.