I am currently working on a program that will scrape data from a psych database of various article metadata. Since not every piece is formatted perfectly, I have accepted that some of the info may not be found. That said, I still want to fill in whatever I do find. As a result I've wrapped basically everything in separate try blocks. This is really ugly. Any ideas about how to fix it? I'm also open to other cleanliness advice.
def get_authors(author_html): """ takes an author list and returns the apa citation""" author_num = 0 authors_list = [] et_al = '' for author in author_html: if author_num == 7: et_al = '., et al.' break author = author.string if '.' in author or '@' in author or len(author.split(',')) != 2: continue last, first = author.split(',') authors_list.append(last + ', ' + first[1]) author_num += 1 return '., '.join(authors_list) + et_al def scrape(html): """ returns as much infor as possible about the html""" soup = BS(html, 'html.parser').find(id='citationFields') if soup is None: raise RuntimeError("Could not find the page") output = {} try: output['Title'] = "".join(child.string for child in soup.find(class_='citation-title').span.children if child.string is not None) except Exception: pass try: output['Authors'] = get_authors(soup.find(string='Authors:').parent.next_sibling('a')) except Exception: pass try: output['Journal'] = soup.find(string='Source:').parent.next_sibling.find('a').string except Exception: pass try: output['Abstract'] = "".join("".join(child.string for child in soup.find(string='Abstract:').parent.next_sibling.children if child.string is not None).split(' (PsycINFO')[:-1]) except Exception: pass try: output['Year'] = soup.find(string='Release Date:').parent.next_sibling.string[0:4] except Exception: pass return output
Here is the HTML I'm passing in. Sorry it's such a mess, the website I pulled it from gave it to me that way.
<div class="citation-wrapping-div" data-auto="citation"><h2 class="hidden" data-auto="citation_heading_hidden" xmlns:viewExtensions="http://www.ebscohost.com/schema/viewExtensions">Detailed Record</h2><dl id="citationFields" class="citation-fields" data-auto="citation_fields" xmlns:viewExtensions="http://www.ebscohost.com/schema/viewExtensions"><dt data-auto="citation_field_label" class="title-label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Title:</dt><dd class="citation-title color-s4" data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController"><a name="citation" data-auto="citation_title"><span>Near-infrared spectroscopy (NIRS) neurofeedback as a treatment for children with attention deficit hyperactivity disorder (ADHD)—A pilot study.<img src="http://imageserver.ebscohost.com.ezproxy.oberlin.edu/branding/j_st/icon_OpenAccess_PLOS.jpg" alt="Open Access" align="right" /></span></a></dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Authors:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController"><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Marx%2C%20Anna-Maria%22%7C%7Csl~~rl','');" title="Search for Marx, Anna-Maria" id="linkMarxAnna-Maria">Marx, Anna-Maria</a>. Institute for Medical Psychology and Behavioral Neurobiology, University of Tuebingen, Tuebingen, Germany, <a data-auto="ep_link" href="mailto:[email protected]" id="[email protected]" title="[email protected]" data-title="[email protected]">[email protected]</a> <br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Ehlis%2C%20Ann-Christine%22%7C%7Csl~~rl','');" title="Search for Ehlis, Ann-Christine" id="linkEhlisAnn-Christine">Ehlis, Ann-Christine</a>. Department of Psychiatry and Psychotherapy, Psychophysiology and Optical Imaging, University of Tuebingen, Tuebingen, Germany<br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Furdea%2C%20Adrian%22%7C%7Csl~~rl','');" title="Search for Furdea, Adrian" id="linkFurdeaAdrian">Furdea, Adrian</a>. Institute for Medical Psychology and Behavioral Neurobiology, University of Tuebingen, Tuebingen, Germany<br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Holtmann%2C%20Martin%22%7C%7Csl~~rl','');" title="Search for Holtmann, Martin" id="linkHoltmannMartin">Holtmann, Martin</a>. LWL-University Hospital for Child and Adolescent Psychiatry, Ruhr-University Bochum, Hamm, Germany<br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Banaschewski%2C%20Tobias%22%7C%7Csl~~rl','');" title="Search for Banaschewski, Tobias" id="linkBanaschewskiTobias">Banaschewski, Tobias</a>. Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany<br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Brandeis%2C%20Daniel%22%7C%7Csl~~rl','');" title="Search for Brandeis, Daniel" id="linkBrandeisDaniel">Brandeis, Daniel</a>. Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany<br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Rothenberger%2C%20Aribert%22%7C%7Csl~~rl','');" title="Search for Rothenberger, Aribert" id="linkRothenbergerAribert">Rothenberger, Aribert</a>. Clinic for Child and Adolescent Psychiatry, University Medical Center of Goettingen, Goettingen, Germany<br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Gevensleben%2C%20Holger%22%7C%7Csl~~rl','');" title="Search for Gevensleben, Holger" id="linkGevenslebenHolger">Gevensleben, Holger</a>. Clinic for Child and Adolescent Psychiatry, University Medical Center of Goettingen, Goettingen, Germany<br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Freitag%2C%20Christine%20M.%22%7C%7Csl~~rl','');" title="Search for Freitag, Christine M." id="linkFreitagChristineM.">Freitag, Christine M.</a>. Department of Child and Adolescent Psychiatry, Psychosomatics and Psychotherapy, Goethe-University Frankfurt am Main, Frankfurt am Main, Germany<br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Fuchsenberger%2C%20Yvonne%22%7C%7Csl~~rl','');" title="Search for Fuchsenberger, Yvonne" id="linkFuchsenbergerYvonne">Fuchsenberger, Yvonne</a>. Department of Child and Adolescent Psychiatry, Psychosomatics and Psychotherapy, Goethe-University Frankfurt am Main, Frankfurt am Main, Germany<br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Fallgatter%2C%20Andreas%20J.%22%7C%7Csl~~rl','');" title="Search for Fallgatter, Andreas J." id="linkFallgatterAndreasJ.">Fallgatter, Andreas J.</a>. Department of Psychiatry and Psychotherapy, Psychophysiology and Optical Imaging, University of Tuebingen, Tuebingen, Germany<br /><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~AR%20%22Strehl%2C%20Ute%22%7C%7Csl~~rl','');" title="Search for Strehl, Ute" id="linkStrehlUte">Strehl, Ute</a>. Institute for Medical Psychology and Behavioral Neurobiology, University of Tuebingen, Tuebingen, Germany, <a data-auto="ep_link" href="mailto:[email protected]" id="[email protected]" title="[email protected]" data-title="[email protected]">[email protected]</a> </dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Address:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Marx, Anna-Maria, Institute for Medical Psychology and Behavioral Neurobiology, University of Tuebingen, Silcherstr. 5, 72076, Tuebingen, Germany, <a data-auto="ep_link" href="mailto:[email protected]" id="[email protected]" title="[email protected]" data-title="[email protected]">[email protected]</a> </dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Source:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController"><a data-auto="link" href="javascript:__doLinkPostBack('','ss~~JN%20%22Frontiers%20in%20Human%20Neuroscience%22%7C%7Csl~~rl','');" title="Search for Frontiers in Human Neuroscience" id="linkFrontiersinHumanNeuroscience">Frontiers in Human Neuroscience</a>, Vol 8, Jan 7, 2015. ArtID: 1038</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">NLM Title Abbreviation:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Front Hum Neurosci</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Publisher:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Switzerland : Frontiers Media S.A.</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Other Publishers:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Switzerland : Frontiers Research Foundation</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">ISSN:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">1662-5161 (Electronic)</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Language:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">English</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Keywords:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">near-infrared spectroscopy (NIRS), fNIRS, neurofeedback, attention deficit hyperactivity disorder (ADHD), children, prefrontal cortex</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Abstract:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">In this pilot study near-infrared spectroscopy (NIRS) neurofeedback was investigated as a new method for the treatment of Attention Deficit-/Hyperactivity Disorder (ADHD). Oxygenated hemoglobin in the prefrontal cortex of children with ADHD was measured and fed back. 12 sessions of NIRS-neurofeedback were compared to the intermediate outcome after 12 sessions of EEG-neurofeedback (slow cortical potentials, SCP) and 12 sessions of EMG-feedback (muscular activity of left and right musculus supraspinatus). The task was either to increase or decrease hemodynamic activity in the prefrontal cortex (NIRS), to produce positive or negative shifts of SCP (EEG) or to increase or decrease muscular activity (EMG). In each group nine children with ADHD, aged 7–10 years, took part. Changes in parents’ ratings of ADHD symptoms were assessed before and after the 12 sessions and compared within and between groups. For the NIRS-group additional teachers’ ratings of ADHD symptoms, parents’ and teachers’ ratings of associated behavioral symptoms, childrens’ self reports on quality of life and a computer based attention task were conducted before, 4 weeks and 6 months after training. As primary outcome, ADHD symptoms decreased significantly 4 weeks and 6 months after the NIRS training, according to parents’ ratings. In teachers’ ratings of ADHD symptoms there was a significant reduction 4 weeks after the training. The performance in the computer based attention test improved significantly. Within-group comparisons after 12 sessions of NIRS-, EEG- and EMG-training revealed a significant reduction in ADHD symptoms in the NIRS-group and a trend for EEG- and EMG-groups. No significant differences for symptom reduction were found between the groups. Despite the limitations of small groups and the comparison of a completed with two uncompleted interventions, the results of this pilot study are promising. NIRS-neurofeedback could be a time-effective treatment for ADHD and an interesting new option to consider in the treatment of ADHD. (PsycINFO Database Record (c) 2016 APA, all rights reserved)</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Document Type:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Journal Article</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Subjects:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">*<a data-auto="link" href="javascript:__doLinkPostBack('','ss~~DE%20%22Attention%20Deficit%20Disorder%20with%20Hyperactivity%22%7C%7Csl~~rl','');" title="Search for Attention Deficit Disorder with Hyperactivity" id="linkAttentionDeficitDisorderwithHyperactivity">Attention Deficit Disorder with Hyperactivity</a>; *<a data-auto="link" href="javascript:__doLinkPostBack('','ss~~DE%20%22Neurotherapy%22%7C%7Csl~~rl','');" title="Search for Neurotherapy" id="linkNeurotherapy">Neurotherapy</a>; <a data-auto="link" href="javascript:__doLinkPostBack('','ss~~DE%20%22Childhood%20Development%22%7C%7Csl~~rl','');" title="Search for Childhood Development" id="linkChildhoodDevelopment">Childhood Development</a>; <a data-auto="link" href="javascript:__doLinkPostBack('','ss~~DE%20%22Hyperkinesis%22%7C%7Csl~~rl','');" title="Search for Hyperkinesis" id="linkHyperkinesis">Hyperkinesis</a>; <a data-auto="link" href="javascript:__doLinkPostBack('','ss~~DE%20%22Prefrontal%20Cortex%22%7C%7Csl~~rl','');" title="Search for Prefrontal Cortex" id="linkPrefrontalCortex">Prefrontal Cortex</a>; <a data-auto="link" href="javascript:__doLinkPostBack('','ss~~DE%20%22Spectroscopy%22%7C%7Csl~~rl','');" title="Search for Spectroscopy" id="linkSpectroscopy">Spectroscopy</a></dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">PsycINFO Classification:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Developmental Disorders & Autism (3250)</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Population:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Human<br />Male<br />Female</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Location:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Germany</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Age Group:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Childhood (birth-12 yrs)<br />School Age (6-12 yrs)</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Tests & Measures:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Semi-Structured Interview Kiddie-Sads-Present and Lifetime Version<br />Rating scale for ADHD<br />Test Battery for Attentional Performance<br />Kindl-Questionnaire for Health-Related Quality of Life<br />Child Behavior Checklist<br />Health-related Quality of Life Scale DOI: 10.1037/t31130-000<br />Raven Coloured Progressive Matrices<br />Clinical Global Impression Scale<br />Strengths and Difficulties Questionnaire DOI: 10.1037/t00540-000</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Grant Sponsorship:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Sponsor: German Federal Ministry for Education and Research, Bernstein Computational Neuroscience Program, Germany<br />Grant Number: 01GQ0831<br />Recipients: No recipient indicated<br /><br />Sponsor: Deutsche Forschungsgemeinschaft, Germany<br />Grant Number: HO 2503/4-1; BI 195/69-1<br />Other Details: SCP and EMG-feedback groups<br />Recipients: No recipient indicated<br /><br />Sponsor: Deutsche Forschungsgemeinschaft, Germany<br />Recipients: No recipient indicated<br /><br />Sponsor: University of Tuebingen, Germany<br />Other Details: Open Access Publishing Fund<br />Recipients: No recipient indicated</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Methodology:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Empirical Study; Interview; Quantitative Study</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Format Covered:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Electronic</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Publication Type:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Journal; Peer Reviewed Journal</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Publication History:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">First Posted: Jan 7, 2015; Accepted: Dec 11, 2014; First Submitted: Sep 30, 2014</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Release Date:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">20150706</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Correction Date:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">20160919</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Copyright:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution and reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.. Marx, Ehlis, Furdea, Holtmann, Banaschewski, Brandeis, Rothenberger, Gevensleben, Freitag, Fuchsenberger, Fallgatter and Strehl. 2015</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Digital Object Identifier:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController"><a data-auto="ep_link" href="http://dx.doi.org.ezproxy.oberlin.edu/10.3389/fnhum.2014.01038" target="_blank" id="linkhttp:dx.doi.org10.3389fnhum.2014.01038" title="http://dx.doi.org/10.3389/fnhum.2014.01038" data-title="http://dx.doi.org/10.3389/fnhum.2014.01038">http://dx.doi.org/10.3389/fnhum.2014.01038</a> </dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">PMID:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">25610390</dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Accession Number:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController"><strong data-auto="strong_text" xmlns:Translation="urn:EBSCO-Translation">2015-26061-001</strong></dd><dt data-auto="citation_field_label" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">Number of Citations in Source:</dt><dd data-auto="citation_field_value" xmlns:ExtendedMarkupController="urn:ExtendedMarkupController">52</dd></dl></div>