Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create documentation #32

Open
thisistaimur opened this issue May 18, 2021 · 15 comments
Open

Create documentation #32

thisistaimur opened this issue May 18, 2021 · 15 comments

Comments

@thisistaimur
Copy link

No description provided.

@ikreymer
Copy link
Member

Yes, I would like to, just haven't had time thus far :)

Are there particular areas that you are most interested in? How are you using wabac.js (if you are)?

Would also welcome contributions in this area if you have time to help :)

@thisistaimur
Copy link
Author

Cheers for the quick reply!
I am testing tools for WARC replay since the Replay Web.Page tool relies heavily on Shadow DOM elements, which make it difficult to work with JS scripts and WARC replays in the browser. I have not used Wabac before, came across it but don't really know where to start testing it 🔢
And yes, I will try to contribute as much as I can, it's a neat project. Is there any place you think I should start? :)

@Jaifroid
Copy link

Jaifroid commented Sep 6, 2022

@ikreymer Is there any possible timeline on documenting the API for wabac.js? Over at Kiwix, we are particularly interested in URL rewriting functions... I understand how busy you are with various aspects of the Replay.Web project, so this is just a gentle nudge. Many thanks.

@mattbbc
Copy link

mattbbc commented Jan 26, 2023

@ikreymer I would like to add my own interest to this issue, we have terabytes of WARC files which we would like to present to our users and we would like to develop a playback frontend. I'd like to know how to use this to determine if it suits our needs, but like @thisistaimur mentioned I don't know where to start with it.

@Jaifroid
Copy link

Maybe a simple test case / demo showing, how to use the main entry points of the API that appear to be exposed in https://github.com/webrecorder/wabac.js/blob/main/src/api.js ?

@ikreymer
Copy link
Member

@ikreymer I would like to add my own interest to this issue, we have terabytes of WARC files which we would like to present to our users and we would like to develop a playback frontend. I'd like to know how to use this to determine if it suits our needs, but like @thisistaimur mentioned I don't know where to start with it.

@mattbbc are there reasons that replayweb.page won't work for your use case for a frontend? We've built that specifically to act as a customizable frontend for replay, and have documentation on how to use it and embed it in other sites. If the main goal is just to replay WARC files, we recommend using replayweb.page, or building on top of that, rather than working with wabac.js directly. We also recommend packaging WARC files to WACZ, so you can benefit from efficient loading. I would recommend starting with replayweb.page and see from there if there are gaps in what you want it to do.

Still, I agree that we should have more docs, especially for customization that don't work with WARC/WACZ files, such as @Jaifroid use's case which uses ZIM file. We'll see what we can do in creating a simple example that sets up replay and uses the API to get some basic info about the archive.

@Jaifroid
Copy link

If you get a chance to do something along those lines, it would be really helpful! 😊

@mattbbc
Copy link

mattbbc commented Jan 27, 2023

@ikreymer I would like to add my own interest to this issue, we have terabytes of WARC files which we would like to present to our users and we would like to develop a playback frontend. I'd like to know how to use this to determine if it suits our needs, but like @thisistaimur mentioned I don't know where to start with it.

@mattbbc are there reasons that replayweb.page won't work for your use case for a frontend? We've built that specifically to act as a customizable frontend for replay, and have documentation on how to use it and embed it in other sites. If the main goal is just to replay WARC files, we recommend using replayweb.page, or building on top of that, rather than working with wabac.js directly. We also recommend packaging WARC files to WACZ, so you can benefit from efficient loading. I would recommend starting with replayweb.page and see from there if there are gaps in what you want it to do.

Still, I agree that we should have more docs, especially for customization that don't work with WARC/WACZ files, such as @Jaifroid use's case which uses ZIM file. We'll see what we can do in creating a simple example that sets up replay and uses the API to get some basic info about the archive.

Thanks @ikreymer, we just started looking at replayweb.page yesterday and it's easy to get a basic embed working, thanks for your work on that. It works beautifully.

Some of our historical warc files are quite large, and we were considering an application that would load parts of warcs via range requests rather than entire warc files from a url. This might be an Express API that would load the relevant parts of the warc, inject any of our code, then send the rendered document to the user. I wasn't sure if this package would help with that idea or isn't the right thing.

We were also considering something like a React application that could load in the warc url dynamically. Maybe the npm package for replayweb.page might be more appropriate for that but I haven't gotten very far just yet - we're just starting to take stock of our options. I did try using the npm package in a React application to test but I haven't figured that out yet. I get a few build errors with it e.g. an error in pages.js in the npm package where there are imports from flexsearch that aren't being used, and __VERSION__ being undefined in misc.js, but it's entirely possible that I'm not using the package correctly.

@thisistaimur
Copy link
Author

thisistaimur commented Jan 27, 2023

@ikreymer I would like to add my own interest to this issue, we have terabytes of WARC files which we would like to present to our users and we would like to develop a playback frontend. I'd like to know how to use this to determine if it suits our needs, but like @thisistaimur mentioned I don't know where to start with it.

Hey @mattbbc, I was able to setup my own WARC server with PyWb, which has a decent documentation. I was able to bypass the Shadow DOM issue by hosting the PyWb server and client page on the same domains (different sub-domains). That worked as far as accessing the WARC content via the browser went. After much digging, I figured out that WARCs with Shadow DOMs can be accessed from a parent page if the the domain of the parent page and the Shadow DOM WARC match.

@mattbbc
Copy link

mattbbc commented Jan 27, 2023

@ikreymer I would like to add my own interest to this issue, we have terabytes of WARC files which we would like to present to our users and we would like to develop a playback frontend. I'd like to know how to use this to determine if it suits our needs, but like @thisistaimur mentioned I don't know where to start with it.

Hey @mattbbc, I was able to setup my own WARC server with PyWb, which has a decent documentation. I was able to bypass the Shadow DOM issue by hosting the PyWb server and client page on the same domains (different sub-domains). That worked as far as accessing the WARC content via the browser went. After much digging, I figured out that WARCs with Shadow DOMs can be accessed from a parent page if the the domain of the parent page and the Shadow DOM WARC match.

Where do you host your WARC files? Ours are all in S3.

@thisistaimur
Copy link
Author

@mattbbc I keep them within the docker container where I host PyWb. PyWb serves the files. PyWb can also serve WARC files from S3 containers.

@ikreymer
Copy link
Member

ikreymer commented Feb 1, 2023

We have two different tools for hosting web archives. ReplayWeb.page system provides a 'serverless' replay system, where web archives can be loaded directly from static storage/S3 with a web based system. The idea is that web archives are replayed in the browser, the way you'd replay a video or a PDF file. This library is a component of that system, with replayweb.page being the main user-facing tool.

We also have our older tool, pywb, which is a more traditional web replay system, where you need to run it as a server and users' access the web archives through the server. We are continuing to develop both.
If you want users to access web archives directly from S3 without having to maintain a custom server, then replayweb.page should be able to do the job. There are other trade-offs between the tools and the formats needed:
To view archives in replayweb.page, WARCs are generally not sufficient because they don't have a built-in index, so we have the WACZ format that allows you to pre-package archives in a way that can be accessed more quickly.
If you have mostly a fixed amount of data already stored on S3, converting to WACZ may be a reasonable and cost-effective approach.
If you are doing continuous crawling and need fast updates, pywb may be a better option, or some combination.
@mattbbc Happy to chat more if you'd like additional guidance for your use case!

ReplayWeb.page also uses webcomponents, so you can place a tag on a page and have a full web archive load. @thisistaimur I don't quite understand your concern with webcomponents/shadow dom. What are you trying to do that you are having issues with? Are you injecting custom code into the web archive replay?

@ikreymer
Copy link
Member

ikreymer commented Feb 1, 2023

Some of our historical warc files are quite large, and we were considering an application that would load parts of warcs via range requests rather than entire warc files from a url. This might be an Express API that would load the relevant parts of the warc, inject any of our code, then send the rendered document to the user. I wasn't sure if this package would help with that idea or isn't the right thing.

We have created the WACZ format to address this particular issue. With WACZ, the WARC files are packaged into a ZIP file which is then read via range requests. You can package existing WARCs into WACZ files using the py-wacz tool, and then load the resulting WACZ in replayweb.page instead of WARCs

Are the WARCs that you have mostly fixed, or are you doing continuous crawling?
It sounds like this can work for your use case?
We are also working on other ways of doing this, more integration between wabac.js and pywb in the future (if you need to add WARCs dynamically for example).

We were also considering something like a React application that could load in the warc url dynamically. Maybe the npm package for replayweb.page might be more appropriate for that but I haven't gotten very far just yet - we're just starting to take stock of our options. I did try using the npm package in a React application to test but I haven't figured that out yet. I get a few build errors with it e.g. an error in pages.js in the npm package where there are imports from flexsearch that aren't being used, and __VERSION__ being undefined in misc.js, but it's entirely possible that I'm not using the package correctly.

We are using webpack and it currently predefines VERSION so it needs to be defined to build.
But its likely that you don't need to do that if your main goal is to provide standard replay for WARC files - you should be able to create WACZ files and then use the standard replayweb.page embed in your React application.

I realize this is a gap in documentation for replayweb.page - we should mention what to do if you're starting with a large set of WARC files.

As this issue is about documentation in general, perhaps we can continue over email (info [at] webrecorder.net) or also on our forum at https://forum.webrecorder.net/ to discuss this specific use case.

@mattbbc
Copy link

mattbbc commented Feb 1, 2023

We have created the WACZ format to address this particular issue. With WACZ, the WARC files are packaged into a ZIP file which is then read via range requests. You can package existing WARCs into WACZ files using the py-wacz tool, and then load the resulting WACZ in replayweb.page instead of WARCs

Are the WARCs that you have mostly fixed, or are you doing continuous crawling? It sounds like this can work for your use case? We are also working on other ways of doing this, more integration between wabac.js and pywb in the future (if you need to add WARCs dynamically for example).

At the moment we have inherited several terabytes of historical WARC files that are currently in a couple of formats, as in .warc or .warc.gz as well as some .warc.gz.cdx files. We intend to start performing our own crawling and capture too, and we're playing with a Lambda prototype that can capture WARCs using node-warc, which is getting on a bit now, but it works at the moment. We may also want to present a calendar of captures for a given URL similar to the Wayback Machine, and I was investigating pywb for that purpose too but it's early days.

I have a simple harness / HTML page with the example embedding from the replayweb.page docs and I tested it with one of our larger WARC files after using your tool to wrap it up into a .wacz file and it loads significantly faster, thanks for the information on that!

We are using webpack and it currently predefines VERSION so it needs to be defined to build. But its likely that you don't need to do that if your main goal is to provide standard replay for WARC files - you should be able to create WACZ files and then use the standard replayweb.page embed in your React application.

We're not keen on using a CDN for anything important, which was why I was looking into the npm module. I wasn't able to get the standard embedding code to work as React complains when you try and use the <replay-web-page> in a JSX file and I couldn't see anything I could import from the npm module without some errors coming up in the console, I imagine there's a gap in my knowledge there. We can always just download the ui.js and sw.js files from the CDN and self-host them ourselves, and React isn't strictly necessary, unless we want to build something more complicated.

It might be more appropriate to take my queries to the forum though instead of constantly harassing this GH issue though!

@thisistaimur
Copy link
Author

ReplayWeb.page also uses webcomponents, so you can place a tag on a page and have a full web archive load. @thisistaimur I don't quite understand your concern with webcomponents/shadow dom. What are you trying to do that you are having issues with? Are you injecting custom code into the web archive replay?

@ikreymer I use the postMessage method on the window object of the shadow DOM. So yeah I inject custom code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants