Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.
Internet Archive's Wayback Machine has 3 useful public APIs.
- SavePageNow or Save API
- CDX Server API
- Availability API
These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface.
Using pip, from PyPI (recommended):
pip install waybackpy -U
Using conda, from conda-forge (recommended):
See also waybackpy feedstock, maintainers are @rafaelrdealmeida, @labriunesp and @akamhy.
conda install -c conda-forge waybackpy
Install directly from this git repository (NOT recommended):
pip install git+https://github.com/akamhy/waybackpy.git
Docker Hub: hub.docker.com/r/secsi/waybackpy
Docker image is automatically updated on every release by Regulary and Automatically Updated Docker Images (RAUDI).
RAUDI is a tool by SecSI, an Italian cybersecurity startup.
>>>fromwaybackpyimportWaybackMachineSaveAPI>>>url="https://github.com">>>user_agent="Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0">>>>>>save_api=WaybackMachineSaveAPI(url, user_agent) >>>save_api.save() https://web.archive.org/web/20220118125249/https://github.com/>>>save_api.cached_saveFalse>>>save_api.timestamp() datetime.datetime(2022, 1, 18, 12, 52, 49)
>>>fromwaybackpyimportWaybackMachineCDXServerAPI>>>url="https://google.com">>>user_agent="my new app's user agent">>>cdx_api=WaybackMachineCDXServerAPI(url, user_agent)
>>>cdx_api.oldest() com,google)/19981111184551http://google.com:80/text/html200HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3381>>>oldest=cdx_api.oldest() >>>oldestcom,google)/19981111184551http://google.com:80/text/html200HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3381>>>oldest.archive_url'https://web.archive.org/web/19981111184551/http://google.com:80/'>>>oldest.original'http://google.com:80/'>>>oldest.urlkey'com,google)/'>>>oldest.timestamp'19981111184551'>>>oldest.datetime_timestampdatetime.datetime(1998, 11, 11, 18, 45, 51) >>>oldest.statuscode'200'>>>oldest.mimetype'text/html'
>>>newest=cdx_api.newest() >>>newestcom,google)/20220217234427http://@google.com/text/html301Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ563>>>newest.archive_url'https://web.archive.org/web/20220217234427/http://@google.com/'>>>newest.timestamp'20220217234427'
>>>near=cdx_api.near(year=2010, month=10, day=10, hour=10, minute=10) >>>near.archive_url'https://web.archive.org/web/20101010101435/http://google.com/'>>>nearcom,google)/20101010101435http://google.com/text/html301Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ391>>>near.timestamp'20101010101435'>>>near.timestamp'20101010101435'>>>near=cdx_api.near(wayback_machine_timestamp=2008080808) >>>near.archive_url'https://web.archive.org/web/20080808051143/http://google.com/'>>>near=cdx_api.near(unix_timestamp=1286705410) >>>nearcom,google)/20101010101435http://google.com/text/html301Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ391>>>near.archive_url'https://web.archive.org/web/20101010101435/http://google.com/'>>>
>>>fromwaybackpyimportWaybackMachineCDXServerAPI>>>url="https://pypi.org">>>user_agent="Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0">>>cdx=WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017) >>>foritemincdx.snapshots(): ... print(item.archive_url) ... https://web.archive.org/web/20160110011047/http://pypi.org/https://web.archive.org/web/20160305104847/http://pypi.org/ . . # URLS REDACTED FOR READABILITY . https://web.archive.org/web/20171127171549/https://pypi.org/https://web.archive.org/web/20171206002737/http://pypi.org:80/
It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, WaybackMachineAvailabilityAPI
, are also implemented in the CDX server API interface class, WaybackMachineCDXServerAPI
. Also note that the newest()
method of WaybackMachineAvailabilityAPI
can be more recent than WaybackMachineCDXServerAPI
's same method.
>>>fromwaybackpyimportWaybackMachineAvailabilityAPI>>>>>>url="https://google.com">>>user_agent="Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0">>>>>>availability_api=WaybackMachineAvailabilityAPI(url, user_agent)
>>>availability_api.oldest() https://web.archive.org/web/19981111184551/http://google.com:80/
>>>availability_api.newest() https://web.archive.org/web/20220118150444/https://www.google.com/
>>>availability_api.near(year=2010, month=10, day=10, hour=10) https://web.archive.org/web/20101010101708/http://www.google.com/
Documentation is at https://github.com/akamhy/waybackpy/wiki/Python-package-docs.
Demo video on asciinema.org, you can copy the text from video:
CLI documentation is at https://github.com/akamhy/waybackpy/wiki/CLI-docs.
- akamhy (https://github.com/akamhy)
- eggplants (https://github.com/eggplants)
- danvalen1 (https://github.com/danvalen1)
- AntiCompositeNumber (https://github.com/AntiCompositeNumber)
- rafaelrdealmeida (https://github.com/rafaelrdealmeida)
- jonasjancarik (https://github.com/jonasjancarik)
- jfinkhaeuser (https://github.com/jfinkhaeuser)
- mhmdiaa (https://github.com/mhmdiaa)
--known-urls
is based on this gist. - dequeued0 (https://github.com/dequeued0) for reporting bugs and useful feature requests.