Portable .NET XHTML .NET CF .NET 1.0 .NET 1.1 .NET 2.0 CSS XML Mobile Apps C# 2.0 HTML C Intermediate Dev Visual Studio Windows .NET Visual Basic C#

A non-well-formed HTML Parser and CSS Resolver

James S.F. Hsieh

2.86/5 (14 votes)

Mar 21, 2007

2 min read

113535

990

A non-well-formed HTML parser and CSS Resolver builded by pure .NET C#

Download DOLS_HTML.zip - 364.6 KB (10:52, 07/21/2007, GMT +8)

demo:
Screenshot - demo.jpg

The program is very simple to demonstrate the function of library,
it is similar to demo program of MIL HTML Parser (http://www.codeproject.com/dotnet/apmilhtml.asp).

Introduction

This library produces a tree which like DOM tree of a given non-well-formed HTML document,
allowing the developer to read, compose, and modify the tree in a methodical way.
The library is based on MIL HTML Parser, and I try to improve the codepage
encoding problem, tolerance of tag missing, CSS Resolver and efficiency.

Background

This library was written to avoid having to convert a non-well-formed HTML
into XML prior to reading, whilst preserving the distinct HTML qualities.

Using the code

// Open HTML file "Google News.htm" DOL.DHtml.DHtmlParser.DHtmlGeneralParser parser = new DOL.DHtml.DHtmlParser.DHtmlGeneralParser(); DOL.DHtml.DHtmlParser.DHtmlDocument htmlDoc = new DOL.DHtml.DHtmlParser.DHtmlDocument(parser); htmlDoc.Load(@"..\Google News.htm"); //You can modify the HTML tree with htmlDoc.Nodes htmlDoc.Save(@"..\Rebuild.htm"); // Dump the information about HTML tree in IDE debug output window StringBuilder builder = new StringBuilder(); htmlDoc.Dump(builder, ""); System.Diagnostics.Debug.Write("\n" + builder.ToString());

Debug Output information
├Object DHtmlDocument Dump :
│　DHtmlNode number: 6
│　Deep dump in the following:
│　│
│　├Object DHtmlComment Dump :
│　│　Node ID: 1
│　│　Comment content:
================================================
DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
================================================
│　│
│　├Object DHtmlText Dump :
│　│　Node ID: 2
│　│　Text content is white space
│　│
│　├Object DHtmlComment Dump :
│　│　Node ID: 3
│　│　Comment content:
================================================
saved from url=(0033)http://www.google.com/news?ned=us
================================================
│　│
│　├Object DHtmlText Dump :
│　│　Node ID: 4
│　│　Text content is white space
│　│
│　├Object DHtmlElement Dump :
│　│　Node ID: 5
│　│　HTML Tag: <html>
│　│　DHtmlNode number: 3
│　│　Child Object deep dump in the following:
│　│　│
│　│　├Object DHtmlElement Dump :
│　│　│　Node ID: 6
│　│　│　HTML Tag: <head>
│　│　│　DHtmlNode number: 30
│　│　│　Child Object deep dump in the following:
│　│　│　│
│　│　│　├Object DHtmlElement Dump :
│　│　│　│　Node ID: 7
│　│　│　│　HTML Tag: <title>
│　│　│　│　DHtmlNode number: 1
│　│　│　│　Child Object deep dump in the following:
│　│　│　│　│
│　│　│　│　├Object DHtmlText Dump :
│　│　│　│　│　Node ID: 8
│　│　│　│　│　Text content: "Google News"

Structural diagram

HTML Parser

CSS Resolver

History

2007/07/21 Modify to create a new StringBuilder instance in each method that needs one in DHtmlTextProcessor
2007/05/13 Added structural diagram
2007/05/01 Improved tolerance of of attribute structure error
2007/04/29 Fixed one bug about tag missing
2007/03/28 Updated demo program (Added CSS Resolver demo)
2007/03/27 Fixed one bug in initiation of DHtmlElement
2007/03/26
1. New demo program
2. Supported "Visitor Patten" in node hierarchy
2007/03/22　Initial release