0
\$\begingroup\$

I am building a PowerShell class based tool that works with XML files of as much as a few hundred Kb size, and as many as a hundred of these files. I have extensive validation code that is time consuming, so I want to validate the XML and if it is valid hash it and write that hash as an attribute to the root element. My code can then hash on load, compare with that value and skip validation when possible. This is a pretty major speed improvement. To that end, I have this static method of my xml class.

static [String] GetHash ([Xml.XmlDocument]$xmlDocument) { $stringBuilder = [Text.StringBuilder]::new() $hashBuilder = [Security.Cryptography.MD5CryptoServiceProvider]::new() #[Security.Cryptography.SHA384CryptoServiceProvider] $encoding = [Text.Encoding]::UTF8 $xmlToHash = [Xml.XmlDocument]::new() $xmlToHash.AppendChild($xmlToHash.ImportNode($xmlDocument.DocumentElement, $true)) $string = [pxXML]::ToString($xmlToHash) $hashIntermediate = $hashBuilder.ComputeHash($encoding.GetBytes($string)) foreach ($intermediate in $hashIntermediate) { $stringBuilder.Append($intermediate.ToString('x2')) } return $stringBuilder.ToString() } 

As well as this static .ToString() method that is used by the hasher and also used for writing results to the console in testing.

static [String] ToString ([Xml]$Xml) { $toString = [System.Collections.Generic.List[String]]::new() $stringWriter = [IO.StringWriter]::new() $xmlWriter = [Xml.XmlTextWriter]::new($stringWriter) $xmlWriter.Formatting = "indented" $xmlWriter.Indentation = 4 $Xml.WriteTo($xmlWriter) $xmlWriter.Flush() $stringWriter.Flush() $toString.Add($stringWriter.ToString()) $toString.Add("`r" * 2) return [string]::Join("`n", $toString) } 

As you might notice from the commented part of the $hashBuilder line, I have tested this with different algorithms, and there is little difference in performance. Which is a bit of a surprise. So I wonder if there is something else her that I am missing, that is the the/an actual bottleneck? Or is this code actually reasonable close to optimal performance already? Given that I am totally new to classes and digging this deep into .NET, I suspect there is a bit to learn here beyond just optimizing this code.

\$\endgroup\$

    1 Answer 1

    1
    \$\begingroup\$

    In general I don't really see the value in not defining an XSD or DTD and validating against that when loading it with an XmlReader, which feels most correct. You should be able to instance one XmlSchemaSet and compile it. Then that can be repeatedly used to validate the documents without reloading the schema, and that should perform well. That feels like the most correct solution. If you want valid data the most correct thing to do is to validate it. That covers every situation where your goal is to have valid data.

    What you're doing here just feels like a home grown XML digital signature, so your method just raises conceptual concerns to me. If I were to look at what you're doing, I'd be confused by what you were trying to accomplish.

    If you just want to be able to verify what files you've previously validated, I'd prefer just maintaining a CSV file or database with the output of Get-FileHash and some file metadata. That seems much less complicated and doesn't require modifying the XML files.


    However, taking the code as face value, you may see some performance improvement like this on the first method:

    static [String] GetHash ([Xml.XmlDocument]$xmlDocument) { $stringBuilder = [System.Text.StringBuilder]::new(32) $hashBuilder = [System.Security.Cryptography.HashAlgorithm]::Create('md5') $xmlToHash = [Xml.XmlDocument]::new() $xmlToHash.AppendChild($xmlToHash.ImportNode($xmlDocument.DocumentElement, $true)) $hashBuilder.ComputeHash([System.Text.StringBuilder]::UTF8.GetBytes([pxXML]::ToString($xmlToHash))).ForEach({ $stringBuilder.Append($_.ToString('x2')) }) return $stringBuilder.ToString() } 

    When you know the length of the StringBuilder you can instance it with that initial capacity. MD5s are always 32 characters. That saves having to expand capacity.

    The ComputeHash() method is an inherited method from the HashAlgorithm class. You don't gain anything by instancing a more complex child class because you don't use it for anything else. You're just calling a hash function; there's nothing inherently cryptographic about what you're doing.

    The .ForEach({}) collection method is somewhat higher performant in most cases, although it makes for less readable code and is more obscure.

    Eliminating variables that are only used once also tends to improve performance. Indeed, you could also just call [System.Security.Cryptography.HashAlgorithm]::Create('md5').ComputeHash(...), but it's getting a bit hard to read on StackOverflow.

    Offhand, I'm not sure if $stringBuilder.AppendFormat('{0:x2}',$_) is better or worse than $stringBuilder.Append($_.ToString('x2')), but I believe the latter is better here.

    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.