Zimit
Zimit is a tool allowing to create a ZIM file of "any" Web site.
Context
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.
Principle
The principles of Zimit are:
- Crawl the remote WebSite to retrieve all the necessary content
- Save all the retrieved content in WARC file(s)
- Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)
- Read the ZIM file in any Kiwix reader
Player
- the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.
- In the reader Wabac.js, there is only one specific part related to ZIM content structure and this is in "RemoteWARCProxy". This part knows how to retrieve content from the specific ZIM storage backend. For the rest the code is the same as before.
- Regarding URL rewriting itself, we have two kinds:
- The static URL rewriting which is done with Wombat (mostly code-driven)
- The Fuzzy matching which is done within the ServiceWorker (mostly data-driven)
- The URL rewriting is done at two levels:
- When the javascript code calls specific Browsers API, these calls are superseeded and ultimatively call Wonbat
- When a URL is called, then it goes through the service-worker which does the fuzzy-matching and the URL rewriting.
Source code
- Browsertrix crawler, the Web crawler which gather everything in a WARC file
- Wombat a standalone client-side URL rewriting system
- Warc2zim, a command line tool transforming a WARC file to a ZIM file
- Wabac.js is the ServiceWorker based reader for the content
- Zimit, the packaging within a Docker image of both Browsertrix and Warc2zim
- Zimit frontend, which is the Web UI use for the Zimit SaaS solution youzim.it
Current implementation workflow (to be confirmed)
At creation time
- Browsertrix create somehow a WARC file.
- warc2zim is converting the warc file into a zim file. To do so it does:
- Loop on all records in the WARC file.
- For each record:
- Extract the url : "urlkey" if present, else "WARC-Target-URI"
- Add a `H/<url>`, containing the Headers of the record
- Add a `A/<url>`, the content (payload) of the record (if record is not a revisist) If content is html, it also insert a small js script which redirect to index.html if SW is not loaded.
- Add the wabac.js replayer (which also "contains" wombat).
- Add a "front page" (index.html) which loads the wabac SW when opened.
- Add a "top frame" page with a iframe and small script (mainly in charge to sync history and icons).
At reading
- User goes to a page. If SW is not loaded, inserte script redirect to index.html, which load SW and register itself as new collection (using "top frame" as top page) and redirect to request page once collection is added.
- SW handle the URL, it does:
- Find the rigth collection (base on book name)
- make coll.handleRequest
- does `getReplayResponse`
- does store.getResource()
- Do a request for
H/url
and if not found, generate "fuzzy url" and do requestH/fuzzyurl
for each fuzzy url. Once it found aH/(fuzzy)url
it stops. If it doesn't found a header return null - If header is a revisit, resolve it (by doing another request to
H/target_url
) - At the end, get the payload by doing
A/final_url
- Build a
ArchiveResponse
with header and payload
- Do a request for
- does store.getResource()
- insert js script loading wombat in the html content.
- rewrite the ArchiveResponse content.
- merge headers from ArchiveResponse into the SW response (
range
,date
,set-cookies
, ...) - return response to requester
- does `getReplayResponse`
Wombat is loaded in all pages as a web worker. Js code is wrapped in a wombat context which rewrite outgoing url (fetch/location changes/...) before doing the request itself.
Comparison with pywb.
The workflow of pywb (with a WARC archive) is almost the same but with small simplification as the rewriting part and fuzzymatching is made by the server itself without serviceworker.
- User goes to a specific url (helped with frontend ui).
- pywb get the url, search for the record (potentially with fuzzy matching).
- Once it has the record, it rewrite the payload and it return a response (merging the record's headers in the response).
Rewriting the payload is the same as what is done in the SW (replace html/css link and insert wombat load)
At the end, all links are relative (or point to the server).
Rewriting urls
See documentation at https://pywb.readthedocs.io/en/latest/manual/rewriter.html
All(?) the rewriting is the following :
abs_url
-> <server_host>/<collection>/<timestamp><modifier>/abs_url
.
<collections>
is the name for the "set of record" (a warc ?, several ?). In our case, it is the book name<timestamp>
is necessary as a collection may contains records for different scrapping. In our case we have one scrapping per book (and so per collection)<modifier>
is how we should rewrite the content:id_
is no modification (identical)mp_
is main page. As modification is base on the content type, `mp_` can be applyied to all type of content.js_
andcs_
. Force a modification as js or css event if content type is something else (html).im_
,oe_
,if_
,fr_
Historical modifier, same asmp_
Rewriting the content
CSS rewriter : rewrite links
JS rewriter: rewrite few links but mostly wrap the code in a "wombat context".
HTML rewrite: rewrite html and use CSS/JS rewriter as subrewriter for <style>
/<script>
tags
JSONP rewriter: May rewrite the content base on the request's querystring (!!!!!)
Proposed solution
At creation
Use pywb rewritter module (https://github.com/webrecorder/pywb/tree/main/pywb/rewrite) to statically rewrite the content (record payload) at zim creation time.
Few things can be done statically:
<timestamp>
: we could remove it (or we know it)<modifier>
: depends of the content type and we know it<url>
: Is in the record's header
Few things may be not possible to do statically:
<server_host>
: depends of the production environement (host name, root prefix)<collection>
: depends of the zim filename (we may change to base ourselves on zimid ?)<requested_url>
: In case of "revisit", pywb and wabac return the content of another record. It rewrite the content based on "the requested url or the record url ?". The same way, in case of fuzzymatching, request url is different than record url.- jsonp need access to the "callback" querystring value of the request.
We could do the static rewriting by setting placeholder (${RW_SERVER_HOST}
, ${RW_URL}
, ...) for things that needs to be rewritten dynamically.
At reading
- Make libzim/libkiwix understand warc headers (or should we define our and rewrite warc headers in our format ?)
- Make libzim/libkiwix do fuzzy matching using headers info and fuzzy matching rules (defined https://github.com/webrecorder/pywb/blob/main/pywb/rules.yaml or (https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js) https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/index/fuzzymatcher.py
- Once a payload is found, dynamically replace placeholders
- Return content
General workflow on kiwix-serve (WIP):
For a given requested /content/<book_name>/<url>
- Search for zim file corresponding to
<book_name>
. - Search for
C/<url>
- If Found => Answer with content of
C/url
(with dynamic rewrite). IfH/url
set the http response headers withH/url
's headers. - If not found, search for
H/url
as it may be a revisit- If found, replace `url` by revisit target and do 2.
- If Found => Answer with content of
- If no answer by 2.
- If fuzzy rules definition is present in the zim files (
W/fuzzy_rules
?), generate fuzzy urls and do 2. with each fuzzy rule
- If fuzzy rules definition is present in the zim files (
- If no fuzzy rules match, answer 404
This workflow should be compatible with existing zim files (no H
nor W/fuzzy_rules
).
Searching by C/url
first allow to avoid putting a H/url
for the common case, even for warc2zim files.
This allow potential fuzzy matching for other zim files (specific scrapper)
Should be pretty "easy" to implement if we defined well:
- The possible placeholders (
${RW_SERVER_HOST}
, ...) and their value - The header
H/url
format (just a subset of header to apply ?) - The fuzzy rules (how to generate fuzzy url from the data driven fuzzy_rules). https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js seems to be a good "specification"
Notes:
- Revisit and redirect are different: redirect make kiwix-serve return a 302 to the target. revisit make kiwix-serve answer a 2xx with the content of the target revisit.
- We may anyway store
H
revisit as redirect entry in the zim file.
Questions
Kelson
- How well maintained is the Python server Pywb? Who use it?
- Do we have other places on top of "RemoteWARCProxy" where we have javascript code dedicated to Kiwix in Wabac/Wonbat?
- I URL rewriting really data-driven? Same question for Fuzzy-matching?
- Can we easily use Wombat without the rest of Wabac?
Matthieu
- What are the information needed to rewrite html/css/js content ? At which point it is linked to the current request ? I have identified
callback
querystring. Other ? - Do we rewrite content using the url of the record or the requested url ?
- pywb can work framed or frameless (https://pywb.readthedocs.io/en/latest/manual/configuring.html#framed-vs-frameless-replay). We are using a framed system with SW. Why ? Is it necessary ?
- pywb rewriter (https://github.com/webrecorder/pywb/tree/main/pywb/rewrite) and wabac.js rewrite (https://github.com/webrecorder/wabac.js/tree/main/src/rewrite) seems to do the same things. What are the differences (apart from implementation languages) ?
- Same question for pywb fuzzymatcher (https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/index/fuzzymatcher.py) and wabac fuzzymatcher (https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js)
renaud
- What makes the SW mandatory to replay? What is the constraint that requires it?
- If not restricted to the sole browser (ie. kiwix-serve or any kiwix reader serving as a dynamic backend), what are the key information that are required for wombat? Just the serving URL? Is the timestamp important?
- Fuzzy Matching rules are found in wabac, wombat, pywb and warc2zim. Is this redundancy or are tere multiple layers?
- What's the extent of wombat's role? How far does it go and how required is it?
- What are “prefix queries”? “prefix search”?
- How does the replayer cache system works? What's its main purpose? Can it be turned off?
- What's the difference between a page as (in pages.jsonl) and a `text/html` entry?