Skip to content

akamhy/waybackpy

Repository files navigation


Python package & CLI tool that interfaces the Wayback Machine APIs

Unit TestscodecovpypiDownloadsCodacy BadgeGitHub lastest commitPyPI - Python VersionCode style: black


Introduction

Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.

Internet Archive's Wayback Machine has 3 useful public APIs.

  • SavePageNow or Save API
  • CDX Server API
  • Availability API

These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface.

Installation

Using pip, from PyPI (recommended):

pip install waybackpy -U

Using conda, from conda-forge (recommended):

See also waybackpy feedstock, maintainers are @rafaelrdealmeida, @labriunesp and @akamhy.

conda install -c conda-forge waybackpy

Install directly from this git repository (NOT recommended):

pip install git+https://github.com/akamhy/waybackpy.git

Docker Image

Docker Hub: hub.docker.com/r/secsi/waybackpy

Docker image is automatically updated on every release by Regulary and Automatically Updated Docker Images (RAUDI).

RAUDI is a tool by SecSI, an Italian cybersecurity startup.

Usage

As a Python package

Save API aka SavePageNow

>>>fromwaybackpyimportWaybackMachineSaveAPI>>>url="https://github.com">>>user_agent="Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0">>>>>>save_api=WaybackMachineSaveAPI(url, user_agent) >>>save_api.save() https://web.archive.org/web/20220118125249/https://github.com/>>>save_api.cached_saveFalse>>>save_api.timestamp() datetime.datetime(2022, 1, 18, 12, 52, 49)

CDX API aka CDXServerAPI

>>>fromwaybackpyimportWaybackMachineCDXServerAPI>>>url="https://google.com">>>user_agent="my new app's user agent">>>cdx_api=WaybackMachineCDXServerAPI(url, user_agent)
oldest
>>>cdx_api.oldest() com,google)/19981111184551http://google.com:80/text/html200HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3381>>>oldest=cdx_api.oldest() >>>oldestcom,google)/19981111184551http://google.com:80/text/html200HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3381>>>oldest.archive_url'https://web.archive.org/web/19981111184551/http://google.com:80/'>>>oldest.original'http://google.com:80/'>>>oldest.urlkey'com,google)/'>>>oldest.timestamp'19981111184551'>>>oldest.datetime_timestampdatetime.datetime(1998, 11, 11, 18, 45, 51) >>>oldest.statuscode'200'>>>oldest.mimetype'text/html'
newest
>>>newest=cdx_api.newest() >>>newestcom,google)/20220217234427http://@google.com/text/html301Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ563>>>newest.archive_url'https://web.archive.org/web/20220217234427/http://@google.com/'>>>newest.timestamp'20220217234427'
near
>>>near=cdx_api.near(year=2010, month=10, day=10, hour=10, minute=10) >>>near.archive_url'https://web.archive.org/web/20101010101435/http://google.com/'>>>nearcom,google)/20101010101435http://google.com/text/html301Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ391>>>near.timestamp'20101010101435'>>>near.timestamp'20101010101435'>>>near=cdx_api.near(wayback_machine_timestamp=2008080808) >>>near.archive_url'https://web.archive.org/web/20080808051143/http://google.com/'>>>near=cdx_api.near(unix_timestamp=1286705410) >>>nearcom,google)/20101010101435http://google.com/text/html301Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ391>>>near.archive_url'https://web.archive.org/web/20101010101435/http://google.com/'>>>
snapshots
>>>fromwaybackpyimportWaybackMachineCDXServerAPI>>>url="https://pypi.org">>>user_agent="Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0">>>cdx=WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017) >>>foritemincdx.snapshots(): ... print(item.archive_url) ... https://web.archive.org/web/20160110011047/http://pypi.org/https://web.archive.org/web/20160305104847/http://pypi.org/ . . # URLS REDACTED FOR READABILITY . https://web.archive.org/web/20171127171549/https://pypi.org/https://web.archive.org/web/20171206002737/http://pypi.org:80/

Availability API

It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, WaybackMachineAvailabilityAPI, are also implemented in the CDX server API interface class, WaybackMachineCDXServerAPI. Also note that the newest() method of WaybackMachineAvailabilityAPI can be more recent than WaybackMachineCDXServerAPI's same method.

>>>fromwaybackpyimportWaybackMachineAvailabilityAPI>>>>>>url="https://google.com">>>user_agent="Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0">>>>>>availability_api=WaybackMachineAvailabilityAPI(url, user_agent)
oldest
>>>availability_api.oldest() https://web.archive.org/web/19981111184551/http://google.com:80/
newest
>>>availability_api.newest() https://web.archive.org/web/20220118150444/https://www.google.com/
near
>>>availability_api.near(year=2010, month=10, day=10, hour=10) https://web.archive.org/web/20101010101708/http://www.google.com/

Documentation is at https://github.com/akamhy/waybackpy/wiki/Python-package-docs.

As a CLI tool

Demo video on asciinema.org, you can copy the text from video:

asciicast

CLI documentation is at https://github.com/akamhy/waybackpy/wiki/CLI-docs.

CONTRIBUTORS

AUTHORS

ACKNOWLEDGEMENTS

close