5
\$\begingroup\$

I started practicing in web-scraping few days ago. I made this code to extract data from a wikipedia page. There are several tables that classify mountains based on their height. However there is a problem with the size of the matrices. Some of them contain 5 columns while others 4. So I made this algorithm to extract all the names and the attributes of the mountains into separate lists. My approach was to create a length list that contains the number of <td> within the <tr> tags. The algorithm finds which table contains four columns and fills the column in excess (in the case of 5 columns) with NONE. However, I believe that there is a more efficient and more pythonic way to do it especially in the part where I use the find.next() function repedetly. Any suggestions are welcomed.

import requests from bs4 import BeautifulSoup import pandas as pd URL="https://en.wikipedia.org/wiki/List_of_mountains_by_elevation" content=requests.get(URL).content soup=BeautifulSoup(content,'html.parser') all_tables=soup.find_all("table",{"class":["sortable", "plainrowheaders"]}) mountain_names=[] metres_KM=[] metres_FT=[] range_Mnt=[] location=[] lengths=[] for table in range(len(all_tables)): x=all_tables[table].find("tr").find_next("tr") y=x.find_all("td") lengths.append(len(y)) for row in all_tables[table].find_all("tr"): try: mountain_names.append(row.find("td").text) metres_KM.append(row.find("td").find_next("td").text) metres_FT.append(row.find("td").find_next("td").find_next("td").text) if lengths[table]==5: range_Mnt.append(row.find("td").find_next("td").find_next("td").find_next("td").text) else: range_Mnt.append(None) location.append(row.find("td").find_next("td").find_next("td").find_next("td").find_next("td").text) except: pass 
\$\endgroup\$
4
  • 1
    \$\begingroup\$Is the code working as expected?\$\endgroup\$CommentedJun 25, 2018 at 23:17
  • \$\begingroup\$Yes, totally. However i want to find -out a better way to scrape tables rather than using find_next() all the time.\$\endgroup\$CommentedJun 25, 2018 at 23:18
  • 3
    \$\begingroup\$Alright; By the way Welcome to Code Review. Hopefully you receive good answers!\$\endgroup\$CommentedJun 25, 2018 at 23:19
  • 1
    \$\begingroup\$Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.\$\endgroup\$CommentedJun 26, 2018 at 14:59

1 Answer 1

2
\$\begingroup\$

You're just looping on the rows, but not on the cells:

 for row in all_tables[table].find_all("tr"): 

Rather than using multiple find_next("td") one after the other, add another loop using row.find_all('td') and append each row and cell to a 2D array.

Manipulating a 2D array is much easier and will make your code look much cleaner than row.find("td").find_next("td").find_next("td").

Good luck!


Those questions contain some answers that might interest you:

To be more specific, this code snippet from @shaktimaan:

data = [] table = soup.find('table', attrs={'class':'lineItemsTable'}) table_body = table.find('tbody') rows = table_body.find_all('tr') for row in rows: cols = row.find_all('td') cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele]) 
\$\endgroup\$
1
  • \$\begingroup\$Thank you for your reply. Because I am new in scraping and in python generally what I understand is that you mean to replace the (try) part of my code with this loop? I did it by the way but data list is empty.\$\endgroup\$CommentedJun 26, 2018 at 11:57

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.