Difference between revisions of "Zimit"

From openZIM
Jump to navigation Jump to search
Line 22: Line 22:
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]

Revision as of 08:22, 22 May 2023

Zimit is a tool allowing to create a ZIM file of "any" Web site.

Context

openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.

Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.

One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.

Principle

the principles of Zimit are:

  • Crawl the remote WebSite to retrieve all the necessary content
  • Save all the retrieved content in WARC file
  • Convert WARC file to ZIM file (this implied embedding a reader in the WARC file, so this is a kind of offline Web App)
  • Read the ZIM file any Kiwix reader

URL rewriting

Source code