I started practicing in web-scraping few days ago. I made this code to extract data from a wikipedia page. There are several tables that classify mountains based on their height. However there is a problem with the size of the matrices. Some of them contain 5 columns while others 4. So I made this algorithm to extract all the names and the attributes of the mountains into separate lists. My approach was to create a length list that contains the number of <td>
within the <tr>
tags. The algorithm finds which table contains four columns and fills the column in excess (in the case of 5 columns) with NONE. However, I believe that there is a more efficient and more pythonic way to do it especially in the part where I use the find.next()
function repedetly. Any suggestions are welcomed.
import requests from bs4 import BeautifulSoup import pandas as pd URL="https://en.wikipedia.org/wiki/List_of_mountains_by_elevation" content=requests.get(URL).content soup=BeautifulSoup(content,'html.parser') all_tables=soup.find_all("table",{"class":["sortable", "plainrowheaders"]}) mountain_names=[] metres_KM=[] metres_FT=[] range_Mnt=[] location=[] lengths=[] for table in range(len(all_tables)): x=all_tables[table].find("tr").find_next("tr") y=x.find_all("td") lengths.append(len(y)) for row in all_tables[table].find_all("tr"): try: mountain_names.append(row.find("td").text) metres_KM.append(row.find("td").find_next("td").text) metres_FT.append(row.find("td").find_next("td").find_next("td").text) if lengths[table]==5: range_Mnt.append(row.find("td").find_next("td").find_next("td").find_next("td").text) else: range_Mnt.append(None) location.append(row.find("td").find_next("td").find_next("td").find_next("td").find_next("td").text) except: pass