65.9K
CodeProject is changing. Read more.
Home

A non-well-formed HTML Parser and CSS Resolver

starIconstarIcon
emptyStarIcon
starIcon
emptyStarIconemptyStarIcon

2.86/5 (14 votes)

Mar 21, 2007

2 min read

viewsIcon

113535

downloadIcon

990

A non-well-formed HTML parser and CSS Resolver builded by pure .NET C#

Download DOLS_HTML.zip - 364.6 KB (10:52, 07/21/2007, GMT +8)

demo:
Screenshot - demo.jpg

The program is very simple to demonstrate the function of library,
it is similar to demo program of MIL HTML Parser (http://www.codeproject.com/dotnet/apmilhtml.asp).

Introduction

This library produces a tree which like DOM tree of a given non-well-formed HTML document,
allowing the developer to read, compose, and modify the tree in a methodical way.
The library is based on MIL HTML Parser, and I try to improve the codepage
encoding problem, tolerance of tag missing, CSS Resolver and efficiency.

Background

This library was written to avoid having to convert a non-well-formed HTML
into XML prior to reading, whilst preserving the distinct HTML qualities.

Using the code

// Open HTML file "Google News.htm" DOL.DHtml.DHtmlParser.DHtmlGeneralParser parser = new DOL.DHtml.DHtmlParser.DHtmlGeneralParser(); DOL.DHtml.DHtmlParser.DHtmlDocument htmlDoc = new DOL.DHtml.DHtmlParser.DHtmlDocument(parser); htmlDoc.Load(@"..\Google News.htm"); //You can modify the HTML tree with htmlDoc.Nodes htmlDoc.Save(@"..\Rebuild.htm"); // Dump the information about HTML tree in IDE debug output window StringBuilder builder = new StringBuilder(); htmlDoc.Dump(builder, ""); System.Diagnostics.Debug.Write("\n" + builder.ToString());

Debug Output information
├Object DHtmlDocument Dump :
│ DHtmlNode number: 6
│ Deep dump in the following:
│ │
│ ├Object DHtmlComment Dump :
│ │ Node ID: 1
│ │ Comment content:
================================================
DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
================================================
│ │
│ ├Object DHtmlText Dump :
│ │ Node ID: 2
│ │ Text content is white space
│ │
│ ├Object DHtmlComment Dump :
│ │ Node ID: 3
│ │ Comment content:
================================================
saved from url=(0033)http://www.google.com/news?ned=us
================================================
│ │
│ ├Object DHtmlText Dump :
│ │ Node ID: 4
│ │ Text content is white space
│ │
│ ├Object DHtmlElement Dump :
│ │ Node ID: 5
│ │ HTML Tag: <html>
│ │ DHtmlNode number: 3
│ │ Child Object deep dump in the following:
│ │ │
│ │ ├Object DHtmlElement Dump :
│ │ │ Node ID: 6
│ │ │ HTML Tag: <head>
│ │ │ DHtmlNode number: 30
│ │ │ Child Object deep dump in the following:
│ │ │ │
│ │ │ ├Object DHtmlElement Dump :
│ │ │ │ Node ID: 7
│ │ │ │ HTML Tag: <title>
│ │ │ │ DHtmlNode number: 1
│ │ │ │ Child Object deep dump in the following:
│ │ │ │ │
│ │ │ │ ├Object DHtmlText Dump :
│ │ │ │ │ Node ID: 8
│ │ │ │ │ Text content: "Google News"

Structural diagram

HTML Parser

CSS Resolver

History

  • 2007/07/21 Modify to create a new StringBuilder instance in each method that needs one in DHtmlTextProcessor
  • 2007/05/13 Added structural diagram
  • 2007/05/01 Improved tolerance of of attribute structure error
  • 2007/04/29 Fixed one bug about tag missing
  • 2007/03/28 Updated demo program (Added CSS Resolver demo)
  • 2007/03/27 Fixed one bug in initiation of DHtmlElement
  • 2007/03/26
    1. New demo program
    2. Supported "Visitor Patten" in node hierarchy
  • 2007/03/22 Initial release
close