Skip to content

Application to demonstrate the interaction between 2 REST APIs that are developed using Go, and basic CRUD operations on MongoDB.

Notifications You must be signed in to change notification settings

vaguecoder/Amazon-Scraper-Collector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Amazon-Scraper-and-Collector-Using-Go-REST-API

⚠️ This is intended for the purpose of demonstration of interaction between multiple REST APIs that are developed using Go, and basic CRUD operations on MongoDB. The output of the application might not be useful, real-time.

What does this project do? 💡

This is a simple Go project that scrapes data (item name, image URL, price, description and total number of reviews posted so far) from the largest E-commerce website, Amazon.com and stores in database. Following are the modules of this project:

Scraper-API, as explained, takes Amazon page URL as input using POST request and scrapes the required details. And makes an internal call to next module, Collector-API using another POST request with the scraped details as form data.

Collector-API gets triggered internally by Scraper-API to collect the so fetched details and place in database. This makes calls to MongoDB module internally for storing in database.

MongoDB module is container created of the official mongo SDK image that is available on Docker Hub. The Collector-API calls the mongo functions to insert or retrieve the data from database.

Knowledge (or) Technologies Used 📚

Sno.NameUsage
1Go (Golang)Go is a statically typed, compiled programming language designed at Google. Helps in speed and concurrency. More info at golang.org.
2REST APIRepresentational State Transfer (REST) is a software architectural style that defines a set of constraints to be used for creating Web services. The rule Zero of using this is, you'll send everything as JSON objects over the application endpoints (API socket ports). More info at restfulapi.net.
3MongoDBMongoDB is a cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. More info at mongodb.com.
4DockerDocker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers. These containers are lightweight virtual-machine kind of platforms which make use of same host OS kernel, but are very scalable and efficient. More info at docker.com.

Base System Configurations 🔧

Sno.NameVersion/Config.
1Operating SystemWindows 10 x64 bit + WSL2 Ubuntu-20.04
2LanguageGo Version 1.14.7 Windows/amd64
3IDEVisual Studio Code Version 1.52.1
4ContainerizationDocker Version 20.10.0, Docker-Compose Version 1.27.4
5DatabaseMongoDB Version 4.4.2

This doesn't probably make any difference in usage if having a different version, as the application works in Docker containers. But of course, the development steps may differ as per your configuration. The required softwares/configurations are mentioned under Prerequisites section.

Prerequisites 📁

Sno.SoftwareDetailDownload Links/Steps
1Docker Version 20.10.0 or HigherContainerizes the application modules for using them as services. This also creates containers of Golang and MongoDB avoiding to download stand-alones for the same.docker.com/products/docker-desktop
2Docker-Compose Version 1.27.4 or HigherA CLI of Docker that helps to run docker-compose.yml which is helps in building/starting/stopping all the containers at once and with ease. If using Windows (or) Mac, the Docker-Compose automatically gets downloaded along with Docker.docs.docker.com/compose/install/
3Postman (or any equivalent)Making the GET requests, and most importantly, the POST requests to the API are made easy with Postman.postman.com/downloads/

If you're using Postman, for avoiding the issue with Proxy while running this application, go to Postman -> File -> Settings -> Proxy and uncheck "Use the system proxy" option.

Useful Socket-Ports: 🤝

Sno.Port NumberEndpointDefined Calls
18080Scraper-APIhttp://localhost:8080/scraper
28081Collector-APIhttp://localhost:8081/collector
327017MongoDBmongodb://localhost:27017

These ports are hard coded for now, but might be dynamically binded in future developments.

Setup Application in Local 📑

Following the steps to recreate the application in your local (in Docker Containers) to scrape and save to database.

  1. Download the whole repo and place anywhere in the local. Go's location constraint doesn't apply as application runs in containers.
  2. Open any terminal in the same location and build using docker-compose.
docker-compose build 
  1. Run the application (includes all module containers along with network) at once using the command:
docker-compose up -d 

The option -d (short form of --detach) here means the container runs in background.

Using docker-compose here has the following advantages:

  • It builds the services from the containers, which in turn, builds the services from our modules (Scraper-API & Collector-API) and skips to build mongo as it can access the official image (MongoDB) all by itself.
  • It runs all the services as we expect from the command.
  • Creates a network by default which helps in interaction between all the containers in the same app.
  • Takes relative paths and root accesses of the project by default. Hence the project location is not a constraint unlike the native Go workspace.
  • Downloads the images that are mentioned in Dockerfile(s) for creating containers of/out of it. Like in this app, we had made use of mongo and golang official SDK images from Docker Hub.
  • Also, we've included the :latest tag for the mongo and golang images in Dockerfiles. This checks and keeps the containers up-to-date.

Last lines if output should be:

Creating Collector-API ... done Creating Scraper-API ... done Creating mongodb ... done 
  1. To verify the services/containers processes running, run the following commands:
docker-compose ps docker ps 

The first command, docker-compose ps gives the status of all the services running in the same directory and the second, docker ps shows all the docker processes running on the desktop.

Output:

Amazon-Scraper-and-Collector-Using-Go-REST-API>docker-compose ps Name Command State Ports -------------------------------------------------------------------------------------- Collector-API ./collector-api Up 0.0.0.0:8081->8081/tcp Scraper-API ./scraper-api Up 0.0.0.0:8080->8080/tcp, 8081/tcp mongodb docker-entrypoint.sh mongod Up 0.0.0.0:27017->27017/tcp Amazon-Scraper-and-Collector-Using-Go-REST-API>docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 44fc708b30cc scraper-api "./scraper-api" 14 minutes ago Up 13 minutes 0.0.0.0:8080->8080/tcp, 8081/tcp Scraper-API ab8af239575a collector-api "./collector-api" 14 minutes ago Up 13 minutes 0.0.0.0:8081->8081/tcp Collector-API d28f9f2d0338 mongo:latest "docker-entrypoint.s…" 14 minutes ago Up 13 minutes 0.0.0.0:27017->27017/tcp mongodb 

Make use of the application after this. The application should function. How to use, is explained under Making the Calls and Using Postman sections below.

  1. Close the services/application:
docker-compose down 

This closes all the services gracefully (the word it uses for non-force shutting). And removes the containers and network so created.

Note: The automatic removal process wipes out only the containers and network that got created, and leaves the 2 images that are built (Scraper-API & Collector-API) and of course, the downloaded images (mongo & golang) remain. This is good. If not removed, the containers might collide with the new containers that will be build and might also lead to build failures.

Output:

Stopping Scraper-API ... done Stopping Collector-API ... done Stopping mongodb ... done Removing Scraper-API ... done Removing Collector-API ... done Removing mongodb ... done Removing network amazon-scraper-and-collector-using-go-rest-api_default 

Making the Calls 📲

The 4 possible calls to this application are as follows:

Sno.PortMethodURLForm DataDetailsImportance
18080POSTlocalhost:8080/scraperYes, send Amazon page URLThe main call for the application to scrape the data from Amazon page and save to database. All other calls in this operation are internal.HIGH
28081GETlocalhost:8081/collectorNot applicableThis returns all the records (documents in Mongo terminology) in database onto page body, for cross-checking the data loading and can be used for future endpoints to retrieve data.MEDIUM
38080GETlocalhost:8080/scraperNot applicableOnly for debugging the endpoint.LOW
48081POSTlocalhost:8081/collectorYes, all the product detailsOnly for debugging the endpoint.LOW

i. POST request to Scraper-API

URL: localhost:8080/scraper

Form Data:

{ "url":"https://www.amazon.com/PlayStation-4-Pro-1TB-Console/dp/B01LOP8EZC/" } 

That is just a sample URL by the time of development of this project. Use any page of Amazon that has only 1 product in it to make sure the prices and details won't collide.

Sample Output:

For URL: https://www.amazon.com/PlayStation-4-Pro-1TB-Console/dp/B01LOP8EZC/ Product details scraped and stored in database with ID: 5fc40bdc092ea8b4c0c2c094 

There can be 3 possible types of output for this:

  1. New Record: This inserts the data and returns the ID.
  2. Existing Record: This confirms on existance of record.
  3. Existing Record but updated data on Amazon page: This confirms on existance, updates the record in database and confirms that too.

ii. GET request to Colector-API

URL: localhost:8080/scraper

Form Data: NA

Sample Output:

[ { "_id": "5fc40bdc092ea8b4c0c2c094", "url": "https://www.amazon.com/PlayStation-4-Pro-1TB-Console/dp/B01LOP8EZC/", "product": { "name": "PlayStation 4 Pro 1TB Console", "imageURL": "https://images-na.ssl-images-amazon.com/images/I/41GGPRqTZtL._AC_SX355_.jpg", "description": "Heighten your experiences. Enrich your adventures. Let the super charged PS4 Pro lead the way. 4K TV Gaming : PS4 Pro outputs gameplay to your 4K TV. More HD Power: Turn on Boost Mode to give PS4 games access to the increased power of PS4 Pro. HDR Technology : With an HDR TV, compatible PS4 games display an unbelievably vibrant and life like range of colors. ", "price": "$339.00", "totalReviews": 8725 }, "last_update": "2020-11-29T21:00:12.958Z" } ] 

Using Postman 📧

As explained above, postman helps in making the calls, especially the POST method calls to hosts. Make the proxy disable as explained under Prerequisites section. The steps are pretty simple.

  1. Download Postman from official site postman.com/downloads/.
  2. Install and launch the application in local.
  3. Select the method (GET/POST) from the drop-down and enter the above mentioned URLs (one at a time) in the corresponding space.
  4. If POST requests, you'll find the Body option just below the URL. Click and go as follows:
    Body (Sub Menu) -> Raw (Radio Button) -> JSON (from Drop-down)
    This will enable the description form where you can copy-paste JSON data. This works more efficiently than selecting individual Key-Value pairs. Use the URLs, methods, form data and check for outputs in this with that of mentioned under Making the Calls section above.

This concludes everything that is required to check and make use of the Amazon-Scraper-Collector. The code walk-throughs will be added in the future developments on this. For any issues, queries or discussions, please update in issues menu or write to vaguecoder0to.n@gmail.com.

Happy Coding !! 🤘

About

Application to demonstrate the interaction between 2 REST APIs that are developed using Go, and basic CRUD operations on MongoDB.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
close