Ramblings and Thoughts Related to Software and Website Development. I Moderate This Blog With An Iron Fist. There Will Be No Mercy!
Published on August 28, 2008 By andrew_ In Software Development

Every day I spend a little time (aka. lunch) and tinker around with side projects or questions that I've raised to myself while browsing the net, or thinking about ideas for the apps I would like to write, that I have already written, or that I wrote and sent to the graveyard. Lately I've been toying with Html Parsing and came upon something I've not yet dealt with much with web development: css selectors. Css selectors are a really neat method of selecting nodes/elements in an html document based on logical ordering and criteria.

An examle of css selectors:

Code: css
  1. DIV P *[href] {}
  2. DIV OL > LI P {}
  3. DIV > P:first-child { text-indent: 0; }
  4. H1.opener + H2 { margin-top: -5mm; }
  5. SPAN[hello="Cleveland"][goodbye="Columbus"] { color: blue; }
  6. .body .forum { width: 800px; }

Now, there's nothing native (outside of a web browser) for .NET that will parse or handle these selectors. That makes sense because there isnt a native HTML DOM parser either. However, Xml documents have great representation in the CLR. After all, most of the websites that webdev does now is all XHTML anyhow, which is just strict HTML based on XML (>> loosely translated <<). XPath is a means to select nodes in an xml document much the same way that css selectors work. So I went on a mission to find a method of translating, or converting, Css selectors to an XPath statement. Lo and behold Joe Hewitt came up with some javascript a few years ago and it's propigated the web thoroughly like any good little script.

So today I translated it into something usable for .NET. The conversion is to C#. If you're a VB.NET developer and you can't read this, and/or can't translate between C# and VB.NET with ease then shame on you. Go make yourself SMRT and learn the C# (read: C) syntax already.

Code: c#
  1.       //Rules verified from http://plasmasturm.org/log/444/
  2.       //Converted from http://www.joehewitt.com/blog/files/getElementsBySelector.js
  3.       public static string CssToXPath(string rule)
  4.       {
  5.          Regex rElement = new Regex(@"^([#.]?)([a-z0-9\\*_-]*)((\|)([a-z0-9\\*_-]*))?", RegexOptions.IgnoreCase | RegexOptions.ECMAScript);
  6.          Regex rAttr1 = new Regex(@"^\[([^\]]*)\]", RegexOptions.IgnoreCase | RegexOptions.ECMAScript);
  7.          Regex rAttr2 = new Regex(@"^\[\s*([^~=\s]+)\s*(~?=)\s*""([^""]+)""\s*\]", RegexOptions.IgnoreCase | RegexOptions.ECMAScript);
  8.          Regex rPseudo = new Regex(@"^:([a-z_-])+", RegexOptions.IgnoreCase | RegexOptions.ECMAScript);
  9.          Regex rCombinator = new Regex(@"^(\s*[>+\s])?", RegexOptions.IgnoreCase | RegexOptions.ECMAScript);
  10.          Regex rComma = new Regex(@"^\s*,", RegexOptions.IgnoreCase | RegexOptions.ECMAScript);
  11.          int index = 1;
  12.          List<string> parts = new List<string>();
  13.          parts.Add("//");
  14.          parts.Add("*");
  15.          string lastRule = null;
  16.          while (rule.Length > 0 && rule != lastRule)
  17.          {
  18.             lastRule = rule;
  19.             // Trim leading whitespace
  20.             rule = Regex.Replace(rule, @"^\s*|\s*$", "");
  21.             if (rule.Length == 0)
  22.                break;
  23.             // Match the element identifier
  24.             Match m = rElement.Match(rule);
  25.             if (m.Success)
  26.             {
  27.                if (m.Groups[1].Length == 0)
  28.                {
  29.                   //XXXjoe Namespace ignored for now
  30.                   if (m.Groups[5].Length > 0)
  31.                      parts[index] = m.Groups[5].Value; //"ns:" + m.Groups[5].Value;
  32.                   else
  33.                      parts[index] = m.Groups[2].Value; //"ns:" + m.Groups[2].Value;
  34.                }
  35.                else if (m.Groups[1].Value == "#")
  36.                   parts.Add("[@id='" + m.Groups[2].Value + "']");
  37.                else if (m.Groups[1].Value == ".")
  38.                   parts.Add("[contains(@class, '" + m.Groups[2].Value + "')]");
  39.                rule = rule.Substring(m.Groups[0].Value.Length);
  40.             }
  41.             // Match attribute selectors
  42.             m = rAttr2.Match(rule);
  43.             if (m.Success)
  44.             {
  45.                if (m.Groups[2].Value == "~=")
  46.                   parts.Add("[contains(@" + m.Groups[1].Value + ", '" + m.Groups[3].Value + "')]");
  47.                else
  48.                   parts.Add("[@" + m.Groups[1].Value + "='" + m.Groups[3].Value + "']");
  49.                rule = rule.Substring(m.Groups[0].Value.Length);
  50.             }
  51.             else
  52.             {
  53.                m = rAttr1.Match(rule);
  54.                if (m.Success)
  55.                {
  56.                   parts.Add("[@" + m.Groups[1].Value + "]");
  57.                   rule = rule.Substring(m.Groups[0].Value.Length);
  58.                }
  59.             }
  60.             // Skip over pseudo-classes and pseudo-elements, which are of no use to us
  61.             m = rPseudo.Match(rule);
  62.             while (m.Success)
  63.             {
  64.                rule = rule.Substring(m.Groups[0].Value.Length);
  65.                m = m.NextMatch();
  66.             }
  67.             // Match combinators
  68.             m = rCombinator.Match(rule);
  69.             if (m.Success && m.Groups[0].Value.Length > 0)
  70.             {
  71.                if (m.Groups[0].Value.IndexOf(">") != -1)
  72.                   parts.Add("/");
  73.                else if (m.Groups[0].Value.IndexOf("+") != -1)
  74.                   parts.Add("/following-sibling::");
  75.                else
  76.                   parts.Add("//");
  77.                index = parts.Count;
  78.                parts.Add("*");
  79.                rule = rule.Substring(m.Groups[0].Value.Length);
  80.             }
  81.             m = rComma.Match(rule);
  82.             if (m.Success)
  83.             {
  84.                parts.Add(" | ");
  85.                parts.Add("//");
  86.                parts.Add("*");
  87.                index = parts.Count - 1;
  88.                rule = rule.Substring(m.Groups[0].Value.Length);
  89.             }
  90.          }
  91.          string xpath = string.Join("", parts.ToArray());
  92.          return xpath;
  93.       }

Anyhow, hopefully this will help someone stumbling around teh intartubes looking for a solution to this.


Comments
No one has commented on this article. Be the first!