VBScript Complex Regex Replace

A few days ago I posted a  blog entry on simple regular expression replacements in VBScript. Let me show you a more complex example. It helps to have a purpose, even for demonstration so my need is to convert an html table to CSV output using regular expressions. We’re going to need the functions I’ve written about before but I’ll post them again so you don’t have to go looking for them.

Function RegExReplace(strString,strPattern,strReplace)
On Error Resume Next
    Dim RegEx
    Set RegEx = New RegExp              ' Create regular expression.
    RegEx.IgnoreCase = True             ' Make case insensitive.
    RegEx.Global=True                   'Search the entire String
    RegEx.Pattern=strPattern
        
    If RegEx.Test(strString) Then       'Test if match is made
        RegExReplace = regEx.Replace(strString, strReplace) ' Make replacement.
     Else
         'return original string   
         RegExReplace=strString
    End If
End Function
 
Function RegExMatch(strString,strPattern)
    Dim RegEx
    RegExMatch=False
    
    Set RegEx = New RegExp              
    RegEx.IgnoreCase = True             
    RegEx.Global=True                   
    RegEx.Pattern=strPattern
    
    If RegEx.Test(strString) Then RegExMatch=True
 
End Function 
 
Function GetMatch(strString,strPattern)
    Dim RegEx,arrMatches
    Set RegEx = New RegExp              
    RegEx.IgnoreCase = True             
    RegEx.Global=True                   
    RegEx.Pattern=strPattern
    Set colMatches=RegEx.Execute(strString)
    Set GetMatch=colMatches
End Function

Let’s dig in. Here’s the table I want to parse.

DeviceID Size FreeSpace Volumename SystemName
C: 80024170496 8556191744   CHAOS
F: 80023715840 71938871296 EDGE_DISKGO CHAOS
L: 80024170496 8556191744   CHAOS
Q: 250057060352 136200368128 New Volume CHAOS

I’ll begin by reading in the contents of the html file and saving it to a variable.

Set objfso=CreateObject("Scripting.FileSystemObject")
Set objFile=objfso.OpenTextFile("c:\test\drives.html")
 
Do While objFile.AtEndOfStream <> True
 html=objFile.ReadAll()
Loop
 
objFile.Close

The next step is to strip out just the table. Using a regular expression, I find the text that matches everything between and including the table tags.

'get just the table
If RegExMatch(html,"<table[^>]*>([\S\s]*?)</table>") Then
    Set matches=GetMatch(html,"<table[^>]*>([\S\s]*?)</table>")
    For Each match In matches
        tableText=Trim(match.value)
    Next
End If

All that’s left at this point is to get rid of unnecessary tags like TH and convert TH and/or TD tags. If the pattern is matched in the string, then for every match I call my RegexReplace function.

'strip off <tr> tags
tableText = RegexReplace(tableText,"</?tr[^>]*>","")
'convert <th></th> to ","
tableText = RegexReplace(tableText,"</th><th>",CHR(34) & "," & CHR(34))
'convert <th> or </th> to "
tableText = RegexReplace(tableText,"<th>|</th>",CHR(34))
'convert </td><td> to ","
tableText = RegexReplace(tableText,"</td><td>",CHR(34) & "," & CHR(34))
'convert <td> or </td> to "
tableText = RegexReplace(tableText,"<td>|</td>",CHR(34))

It’s possible there might be some tags still in my tableText variable so I’ll process it one more time looking for any HTML tag and replace it with a blank (“”).

'strip off any remaining tags
tableText = RegexReplace(tableText,"<(?![!/]?[ABIU][>\s])[^>]*>","")

Now the tricky part. If I look at tableText there will be blank lines for any tags I replaced at the end. Plus if I wanted to save the output to a text file I need some way to parse this variable. My solution was to turn it into an array and enumerate it, only displaying lines with a length greater than 0.

'turn remaining text into an array
arrText=Split(tabletext,VbCrLf)
 
'strip out blank lines
For i=0 To UBound(arrText) -1
    if Len(arrText(i)) >0 Then
        'or send output to a text file
        WScript.Echo arrText(i)
    End if
Next    

When I run my script I get output like this:

“DeviceID”,”Size”,”FreeSpace”,”Volumename”,”SystemName”
“C:”,”80024170496″,”8556191744″,”",”CHAOS”
“F:”,”80023715840″,”71938871296″,”EDGE_DISKGO”,”CHAOS”
“L:”,”80024170496″,”8556191744″,”",”CHAOS”
“Q:”,”250057060352″,”136200368128″,”New Volume”,”CHAOS”

Now before you think I’m some Regex guru (not by any means), I didn’t come up with any of the more complex regular expression patterns. Instead I went to my favorite site for this sort of thing, RegexLib.com. Fortunately many people have already done the hard work of developing regular expression patterns for all sorts of things.  A little search and copy/paste and I’m in business. Because regular expressions work the same just about everywhere you can use these expressions in VBScript, PowerShell, PHP, Perl or probably anything you happen to be working in.

Download a text file with code from this entry here.

As always, if you need help with regular expression scripts or any other scripting problem please join me in the forums at ScriptingAnswers.com.  Oh…don’t forget there is an entire chapter on using the REGEX object in VBScript in WSH and VBScript Core: TFM.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

Technorati Tags: , , , ,

Tags: , , , ,

3 Responses to “VBScript Complex Regex Replace”

  1. Jim V Says:

    Excellent Jeff – I hope you keep going and dnon’t give up on explaing this. We scripters should become proficient in reglar expressions as they can save a tremendous amount of coding time and can do some things that are not erally possible with linear code.

    There is a small bug in your code. It will only work on teh versoin of teh HTML you tested with for a couple of reasons. First Regex is not set up for multiline so the line terminators will break the match logic if table elements cross line boundaries. My first attemp to run this showed that it was missing the match because it was spread across two lines in my copy of the table. In many cases teh td and th pairs were separated by one or more spaces. In HTML it is permissible for spaces and line formatters to appear anywhere and in any number without breaking teh HTML. This is tyhe sameas the “C”, “C#” and “C++” specifications which allows us to make teh code look any way that suits our needs.

    is teh same as or or

    All line format characters are ignored by the HTML parsers but may not be ignored by RegEx match logic.

    Frm years of parsing HTML and C code for various reasons I have learned that we need to normalize the line formatters. The easiest way to do this is to strip all of them. The is RegEx code in RegexLib.com that with do this in one transform but I have done it in two to show what is happpening.

    First remove all line feeds and carriage returns. This wiil prevent future mutliline issues although you could also turn on multiline mode (?m) (Regex.Global=True) You have this enabled. This will also require a rethink of your match logic in some cases as matches will now cross line boundaries. Your code alerady accounts for MultiLine mode but I prefer stripping newlines anyway and you will see why shortly.

    REmove all “tab” characters and space characters except space characters in the middle of words that occur in the text area of the tags.

    Example My Empty Value needs to be My Empty Value There are many Regexs to do this in various ways depending on application.

    Replace all elements with vbCrLf which will always work correctly if you do the above first.

    After this the remianing conversion steps will work correctly most of the time. Remember that this will only work for simple tables. Tables with style or other formatting will have to be stripped further first. There are numerous match scripts that will remove all attributes or convert them in some way as needed. The above will help with ensuring that they will work correctly as many fail to factor in teh line formatters issues.

    Here is my adjusted version of your code which will work with a few more variations but still not all. Notice that only one output statement is needed as all formatting necessary is already in the text.

    ‘ remove all line enders to prevent cross line match failures
    tableText = RegexReplace(tableText,vbCrLf,”")
    ‘ remove all blocks of spaces to single space.
    tableText = RegexReplace(tableText,” “,”")
    ’strip off tags
    tableText = RegexReplace(tableText,”",”")
    ‘ convert to vbCrLf newline chars
    tableText = RegexReplace(tableText,”",vbCrLf)
    ‘convert to “,”
    tableText = RegexReplace(tableText,”",CHR(34) & “,” & CHR(34))
    ‘convert or to ”
    tableText = RegexReplace(tableText,”|”,CHR(34))
    ‘convert to “,”
    tableText = RegexReplace(tableText,”",CHR(34) & “,” & CHR(34))
    ‘convert or to ”
    tableText = RegexReplace(tableText,”|”,CHR(34))
    ’strip off any remaining tags
    tableText = RegexReplace(tableText,”\s])[^>]*>”,”")

    ‘ text should now dump as properly formatted CSV.
    WScript.Echo tableText

    Have you made any headway with the use of callbacks to do group replacement?

  2. Jim V Says:

    Jeff

    In case you haven’t already seen this take a look here: the dot,/a>

    It is the best explanation for some behaviors that I have found so far.

  3. Jeffery Hicks Says:

    I’m not surprised there are problems. I should have been clearer that my example wasn’t intended as a ready to roll solution. It worked for me with the particular HTML file I was using. Thanks for the clarifications and suggestions.


Entries (RSS) and Comments (RSS).