BgRva

software, data, designs, …

NUnit Tricks for Parameterized Tests

Having spent a fair amount of time using unit test frameworks, you inevitable come up with a few tricks to make things more efficient. This post simply describes some of those tricks that I’ve picked up while testing code. The examples will be done with NUnit versions 2.6, but I’ve applied these same techniques to MBunit, XUnit, and am currently using them with the NUnit beta 3.0.

Test Ids to Increase Readability

Often, some units under test have a wide variation of input parameters and require many test cases to effectively beat on the code. Take the following example. Though it is contrived, it demonstrates situations I’ve had to handle in almost every project I’ve been involved with. The class below has a single method to verify some input string based on a defined rule set.

Verifyer.cs class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public class Verifier
{
    /// <summary>
    /// Returns true if input contains the substring of criteria.ToUpper()
    /// or criteria.ToLower() and the qualifier occurs after the criteria.
    /// </summary>
    /// <param name="input">The input string being validated</param>
    /// <param name="criteria">The criteria for validating the input string</param>
    /// <returns>bool</returns>
    public bool Verify(string input, string criteria, string qualifier)
    {
        var criteriaIndex = input.IndexOf(criteria.ToLower());
        if(criteriaIndex < 0)
        {
            criteriaIndex = input.IndexOf(criteria.ToUpper());
        }

        if (criteriaIndex >= 0 && input.IndexOf(qualifier, criteriaIndex) > criteriaIndex)
            return true;

        return false;
    }
}

Effective unit tests for the code above would need to test for cases returning true, false, and error cases, (I would normally generate the tests before the method, but for simplicity I will dispense with preaching TDD). For the true cases, here are several approaches using NUnit, (for brevity’s sake, I have not included all possible variations of input parameters to test for true).

VerifierFixture.cs class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
[TestFixture]
public class VerifierFixture
{
    [TestCase(0, "blahA blah", "blah", "A")]
    [TestCase(1, "blah Ablah", "blah", "A")]
    [TestCase(2, "blah blahA", "blah", "A")]
    [TestCase(3, "AblahA blah", "blah", "A")]
    [TestCase(4, "Ablah Ablah", "blah", "A")]
    [TestCase(5, "Ablah blahA", "blah", "A")]
    // more ...
    public void IsValid_Returns_True_By_TestCase(int id, string input, string criteria, string qualifier)
    {
        //Arrange
        var validator = new Verifier();

        //Act
        var result = validator.Verify(input, criteria, qualifier);

        //Assert
        Assert.True(result);
    }

    [Test, TestCaseSource("VerifyCasesThatAreTrue")]
    public void IsValid_Returns_True_By_TestCaseSource(int id, string input, string criteria, string qualifier)
    {
        //Arrange
        var validator = new Verifier();

        //Act
        var result = validator.Verify(input, criteria, qualifier);

        //Assert
        Assert.True(result);
    }

    static object[] VerifyCasesThatAreTrue =
    {
        new object[] {0, "blahA blah", "blah", "A"},
        new object[] {1, "blah Ablah", "blah", "A"},
        new object[] {2, "blah blahA", "blah", "A"},
        new object[] {3, "AblahA blah", "blah", "A"},
        new object[] {4, "Ablah Ablah", "blah", "A"},
        new object[] {5, "Ablah blahA", "blah", "A"},
        // more ...
    };
}

Both test methods in the fixture above perform the same tests, but each handles the multiple test inputs using different features of NUnit. You will notice a unique id associated with each test input, this is one of those ‘tricks’ I picked up. By assigning a unique id for test methods with multiple inputs, you can quickly find a specific test case. For example, here is an image of the NUnit runner output with ids and it is easy to identify in the code the specific test cases which fail.

When should you use unique ids with your parameterized test cases? This is completely up to your style of test kung-fu. If it helps your more efficiently beat on your code then by all means do it.

Overriding ToString() in Complex Parameters

While NUnit recognizes the individual test parameters and consequently each individual test cast, it does not effectively present this information to the user. For example, lets say there is an additional Verify() method that takes in a custom object called UserItem as a parameter. UserItem is a stand in for any POCO that you may need to validate.

Verifier.cs and UserItem.cs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public class Verifier
{
    public bool Verify(string input, string criteria, string qualifier)
    {
        var criteriaIndex = input.IndexOf(criteria.ToLower());
        if(criteriaIndex < 0)
        {
            criteriaIndex = input.IndexOf(criteria.ToUpper());
        }

        if (criteriaIndex > 0 && input.IndexOf(qualifier, criteriaIndex) > criteriaIndex)
            return true;

        return false;
    }

    //New Method
    public bool Verify(UserItem input, string criteria, string qualifier)
    {
        return Verify(input.Name, criteria, qualifier);
    }
}

public class UserItem
{
    public DateTime TimeStamp { get; set; }
    public string Name { get; set; }
    public string SomeThing { get; set; }
}

In order to set up multiple test cases for this new method, we are unable to use the TestCaseAttribute out of the box since we cannot embed a UserItem constructor in an attribute. But, we can still use the TestCaseSourceAttribute like so:

VerifyByUserItemFixture.cs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[TestFixture]
public class VerifyByUserItemFixture
{
    [Test, TestCaseSource("VerifyCasesThatAreTrue")]
    public void IsValid_Returns_True_By_TestCaseSource(int id, UserItem input, string criteria, string qualifier)
    {
        //Arrange
        var validator = new Verifier();

        //Act
        var result = validator.Verify(input, criteria, qualifier);

        //Assert
        Assert.True(result);
    }

    static object[] VerifyCasesThatAreTrue =
    {
        new object[] {0, new UserItem() { Name = "blahA blah" }, "blah", "A"},
        new object[] {1,  new UserItem() { Name= "blah Ablah" }, "blah", "A"},
        new object[] {2,  new UserItem() { Name= "blah blahA" }, "blah", "A"},
        new object[] {3,  new UserItem() { Name= "AblahA blah" }, "blah", "A"},
        new object[] {4,  new UserItem() { Name= "Ablah Ablah" }, "blah", "A"},
        new object[] {5,  new UserItem() { Name= "Ablah blahA" }, "blah", "A"},
        // more ...
    };
}

When you look at the output you will notice that although NUnit identifies the individual test cases, it displays each UserItem by the default ToString() implementation:

To improve the readability of each test, you can override the ToString() method:

UserItem.cs with ToString() override
1
2
3
4
5
6
7
8
9
10
public class UserItem
{
    public DateTime TimeStamp { get; set; }
    public string Name { get; set; }
    public string SomeThing { get; set; }
    public override string ToString()
    {
        return string.Format("{0};{1};{2}", Name, TimeStamp, SomeThing);
    }
}

NUnit uses the parameters to differentiate individual test cases. If you do override ToString(), to preclude NUnit from skipping test cases it considers duplicates then you need to ensure the override of ToString() returns unique values for each object, or use the separate id parameter technique mentioned earlier. Using the id will guarantee each input results in an individual test even if the ToString() override does not provide unique strings for each object.

Just a few tricks that may help you in your code writing endeavors. Test early & test often: the more you beat on your code the stronger is the force in you.

Using NUnit Theory for Integration Tests

NUnit TheoryAttributes (v2.5, v3.0b) can work well for integration tests and for testing generic methods. The integration tests I’m referring to are those created by a developer during development:

Integration Tests are generated during development, (e.g. TDD), that are applied to units of code that have dependencies on other units

Unit Tests are small, highly focused tests which are applied to a unit of code in isolation

The precise boundary of when a developer generated test is considered unit vs integration is squishy at best, and depending who you ask or where you look you will find different answers, (see M. Fowler). Even with this distinction between the test types, both may be implemented with the same framework (i.e. NUnit). Ultimately, test early and test often.

Using the TheoryAttribute allows testing a single unit for which the inputs are a bit more complex than simple parameters, and the behavior under test may have a broader range of edge cases. Integration tests can be used to exercise a method for ‘typical’ scenarios, and I’ve often found that trying to cover every possible scenario can be unrealistic. From the NUnit documentation

A Theory is a special type of test, used to verify a general statement about the system under development. Normal tests are example-based.

(*) as of this post, this statement is applicable to NUnit version 2.5 through the 3.0 beta.

The code samples below are part of a project to generate Scenario objects that contain the input data for testing the functionality of a graph library (yes, I’m testing a library for testing). The Scenarios start out as YAML files and are parsed to an object. Data (of varying types) can be associated with each scenario and is read in from separate YAML data files. All of the parsing and hard work is handled by YamlDotNet by Antoine Aubry. You do not need to worry about YAML or graph functionality in the code samples below, this is just a little background.

YamlScenarioReader: This class reads a scenario data file and returns a scenario object

YamlScenarioReader class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class YamlScenarioReader
{
    public Scenario ReadFile(string fileName)
    {
        Scenario scenario = null;

        var deserializer = new Deserializer(namingConvention: new CamelCaseNamingConvention());

        using (var sr = new StreamReader(fileName))
        {
            scenario = deserializer.Deserialize<Scenario>(sr);
        }

        return scenario;
    }
}

Even though YamlDotNet will be doing the heavy lifting, I still want to run some tests for reading in every file as a cursory inspection of the file formats, since the test scenario files tend to change often post on Yaml. These are an example of what I consider integration tests.

For reading the data, there is a corresponding reader with a T type parameter indicating the type of the underlying data to be read in. Again, YamlDotNet is doing the hard work, all we need to do is provide a way of validating the files.

YamlNodeDataReader class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class YamlNodeDataReader<T>
{
    public Dictionary<int, T> ReadFile(string fileName)
    {
        var data = new Dictionary<int, T>();

        var deserializer = new Deserializer(namingConvention: new CamelCaseNamingConvention());

        using (var sr = new StreamReader(fileName))
        {
            data = deserializer.Deserialize<Dictionary<int, T>>(sr);
        }

        return data;
    }
}

The test fixture is as follows:

YamlNodeDataReaderFixture class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
public class YamlNodeDataReaderFixture
{
    [TestFixture(typeof(double))]
    [TestFixture(typeof(DummyItem))]
    public class ReadFile<T>
    {
        [Datapoints]
        public DataFileWrapper<double>[] DataFilesOfDouble = new[]
                    {
                    new DataFileWrapper<double>(@"F:\GraphProject\DataFiles\Data\edge.A.001.double.txt"),
                    new DataFileWrapper<double>(@"F:\GraphProject\DataFiles\Data\edge.A.002.double.txt"),
                    };

        [Datapoints]
        public DataFileWrapper<DummyItem>[] DataFilesOfDummyItem = new[]
                    {
                    new DataFileWrapper<DummyItem>(@"F:\GraphProject\DataFiles\Data\edge.A.001.DummyItem.txt"),
                    };

        [Theory]
        public void Returns_Dictionary(DataFileWrapper<T> dataFile)
        {
            //Arrange
            var reader = dataFile.GetNodeReader();

            //Act    
            var data = reader.ReadFile(dataFile.FileName);

            //Assert
            Assert.NotNull(data);

            var maxKey = (data.Keys.Select(k => k)).Max();

            Assert.AreEqual(data.Count, maxKey + 1);
        }
    }
}

public class DataFileWrapper<T>
{
    public DataFileWrapper() { }

    public DataFileWrapper(string fileName)
    {
        FileName = fileName;
    }

    public INodeDataReader<T> GetNodeReader()
    {
        return new YamlNodeDataReader<T>();
    }

    public IEdgeDataReader<T> GetEdgeReader()
    {
        return new YamlEdgeDataReader<T>();
    }

    public string FileName { get; set; }
}

The test method annotated with the Theory attribute uses the data properties annotated with Datapoints to provide the input data. Each Datapoints property should correspond to one of the types indicated in the TestFixtureAttributes atop the ReadFile class definition.

YamlNodeDataReaderFixture has “fixture” in its name, but it does not have a unit test attribute. I use the convention of a class for each class under test, then subclasses for each method under test, see Structuring Unit Tests by Phil Haack.

ReadFile class is the test fixture for testing the ReadFile() method of YamlDotNetReader. Each TestFixture attribute indicates what the type parameters to use for T.

DataFilesOfDouble is a property which returns an array of DataFileWrapper objects for each data file containing data of type double

DataFilesOfDummyItem is a property which returns an array of DataFileWrapper objects for each data file containing data of type double

Returns_Dictionary() is the test method to which the input data is applied to the ReadFile() method under test.

DummyItem is simply a class used to represent more complex data in the data files.

DataFileWrapper is a helper class needed in the example to provide a single generic input parameter to the Returns_Dictionary() test method and also provides methods to return which type of reader is under test based on the type of the data. While not absolutely necessary, it makes the tests cleaner and the maintainability of the tests more robust.

Here is an example of the output (in Resharper) showing how the embedded test classes or each method show up:

One annoyance I have with parametized Theory tests is that there is no way to distinguish between the individual tests by the respective parameter values. This can cause a little grief when trying to identify which parameters failed and isolate them, but it is a minor inconvenience for the ability to beat on your code with automated tests.

A New Graph Library for C#

You are probably thinking: Really? Similar libraries exist and they are just fine, so why a new one? Before I dive in to the pros and cons of this adventure, let me clarify what I mean by “Graph” Library. I’m talking about a graph as set of nodes and edges where each edge connects two nodes. There are several libraries available in C# that fall within this scope and they are all quite awesome. A few are listed below, but excluded are graph libraries for visualization or for specific purposes such as ontologies.

The thing about implementing a graph in code is that it is all about compromises. Some implementations may emphasize speed, extensibility, customizability, etc., but each will have its strengths and weaknesses. For example, here is a simple case of a compromise that may arise when considering a graph implementation as an adjacency list. Assume we were implementing a graph and wanted to allow consumers of the code to access the nodes by index from 0 to n-1 (where n is the number of nodes in the graph) in O(1) time. The code samples that follow are simply interface definitions for discussion purposes, implementation details are all hand-waving.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public interface INode
{
    int Index { get; private set; }
}

public interface IGraph
{
    IList<INode> Nodes { get;}

    int NodeCount { get; }
}

// Given an instance of IGraph g, access nodes by index
var node1 = g.Nodes[1];
var node2 = g.Nodes[2];

How nodes are accessed is coupled with how nodes are added and removed. Should we allow the user to create a node object and add the node to the graph or should we allow the graph to handle creation of the node? Each choice has implications for the rest of the implementation. Consider the following two approaches for adding a node:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
public interface IGraph
{
    List<INode> Nodes { get;}

    int NodeCount { get; }

    /// <summary>
    /// #1 The user creates the node and adds it to the graph object and returns
    /// the graph assigned index of the node.
    /// </summary>
    int AddNode( INode newNode);

    /// <summary>
    /// #2 The graph instance creates and returns the node object, with an assigned index
    /// </summary>
    INode AddNode()
}

// Given an instance of IGraph g:
// #1
var node1 = new Node();
var index = g.AddNode(node1);
var node = g.Nodes[index];

// #2
var g = new Graph();
var node = g.AddNode();

Whereas #1 requires the user to instantiate a node, in #2 the graph maintains more control over instantiation but reduces the flexibility of where and how the user can create a node as the graph object must be instantiated first. Both approaches still allow for accessing any node by index, but things get sticky when we consider removing a node:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
var g = new Graph();
// create 10 nodes:
for(int i = 0; i< 10; i++)
    g.AddNode();

// 10 nodes - nice; now iterate through them
for(int i = 0; i< g.NodeCount; i++)
    Console.WriteLine(g.Nodes[i].ToString());

//  *********: Here is the issue *********:
g.RemoveNodeAt(4);

// 9 nodes; now iterate through them
for(int i = 0; i< g.NodeCount; i++)
    Console.WriteLine(g[i].ToString())

If the IGraph implementation provides the ability to access nodes by index from 0 to n-1 and we removed a node, the implementation needs to guarantee the indexing behavior. This may require re-indexing all the nodes which is an O(n) operation. If the graph is being implemented specifically for cases where users are just creating graphs from static data and do not require many removal operations, then this could be an acceptable solution. Realistically, an O(n) operation is not that painful as iterating collections occurs very often. But for large graphs with significant amounts of Add and Removal operatoins, this could impact user experience. At the very least, such behavior and its implications should be clearly documented (in sample code as well) for the consumer.

So back to the call for a New Graph Library for C#. The driving idea (and need) for implementing this is to provide a general purpose graph structure that is quick to pick up, intuitive to use, and straight forward to represent data. By “general purpose” we are not aiming to create a graph implementation optimized for narrow use cases (e.g. speed, memory, algorithms, etc.), but rather one that can be used in a wide latitude of cases needing a graph structure (and one which will handle particular issues such as Self-Loops and Parallel Edges). Additionally, it needs to be small, uncluttered, and it needs to “just work”. This implies quality documentation and sample code: Not just more documentation, but the right documentation. Finally, and perhaps most importantly, this will be built for .Net Core so it can run on Mac, Linux, and Windows.

The name will inevitably come out of the initial discussions (though for some reason “dream-graph” with visions of flowers off of the Mystery Machine have been jokingly put forth so far). More to come and I invite you all to beat on it once the code starts flowing.

Snowball Extraction of a Sub-Network

What if you need a sub-network from a larger one, and how would you extract it? I’ll walk you through the steps of how to do a Snowball extraction, (or extraction by steps). I have no doubt this goes by many names, but Snowball is the name I’ve been familiar with for years.

Assume you are given a network, and by network I mean a graph structure with data attributes on the nodes and/or edges such that the graph represents data connected with respect to a particular context. For example, take a simple graph:

Add some names and now it is a simple social network:

Graph + Data = Network

A Snowball Extration

Lets start with the following example network and perform a snowball extraction 3 steps/hops out.

Step 1: given a set of seed nodes: (2, 16), find all nodes 1 step or hop out from the seed nodes. The nodes we are interested in are the neighbors of the seed nodes along outgoing edges, (in this example we will be considering the directionality of the edges). Seed nodes are in green:

Step 2: Now gather the new nodes found in Step 1: (1, 3, 9, 10, 36) and repeat the process:

Step 3: Get the new nodes from step 2: (4, 6, 8, 1, 12, 17, 34) and repeat:

Step 4: With the new nodes from step 3: (5, 7, 13, 15, 18, 19, 33), combine all found nodes and edges for the resulting sub-network:

Note that using different seed nodes you may end up with a drastically different subnetwork and much depends on the structure and properties of the main network. Additionally, how you handle duplicate edges, self-loops, and parallel-edges are completely determined by your problem at hand. I’ve attempted to formalize the steps as pseudo-code below, but again the complexities of the problem at hand will affect how this should be implemented.

pseudo-code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
foundNodes {}   // all nodes found in the snowball so far
currentNodes {} // the current nodes from which we are stepping out
currentEdges {} // the edges we find outgoing from the current nodes
maxStep = 3
currentStep = 1

nextNodes = {SEEDS}  // this should be initialized to the seeds 4,10

while (currentStep <= maxStep)
  currentNodes = nextNodes;
  foundNodes.Add(currentNodes);

  foreach each edge
    If edge.SourceId  is in currentNodes
      currentEdges.Add(edge)

  nextNodes = {}
  foreach edge in currentEdges
    If edge.TargetId is NOT in foundNodes
      nextNodes.Add(edge.TargetId)

How I Learned to Love YAML

This post describes how I came to use YAML to handle complex data scenarios for integration test cases.

As a proponent of TDD, generating test data is a necessity. When the units under test increase in complexity, so does the test data. Test data can range from simple input strings used in NUnit TestCaseAttribue entries to custom test scenario objects which included raw data as well as expectations of varying outcomes.

Generating complex test scenarios was especially the case when I was working on an application that did graph manipulation and required converting raw data to graph structures. Because of the variations in graph structures that had to be handled when converting from input data, the test scenarios had to provide the raw data to be converted to a graph, the expected graph structure, and other expectations. These custom scenario objects required creating code to read them in, to access the data and expectations, as well as managing changes without requiring brain surgery on the codebase.

As an example of the range of test data needed for converting data to graphs, here are 4 trivial examples of input data and the expected graph:

The combinations of different sub-cases resulted in a multiple scenarios that needed to be built and managed. To compound things, each test case’s data was hand jammed based on a known input an expected output.

In the past, I would end up with some fairly easy to read text file format and multiple parsers to ingest the files and create the scenario objects to feed the tests. While the parsers themselves were simple and required limited error handling, the requirements of the scenarios would inevitably change over the course of the project. For example, let’s say based on some new functionality, in order to implement the tests we needed to know the expected number of self-loops (an edge that has the same source and target node) for an input graph. This would require adding a new property to the scenario object, updating the parsers, and verifying the parsers worked. While not difficult or risky, these updates still sponge up valuable time and resources.

How come I didn’t format the data as XML or JSON? Because the integration tests requiring the complex test data grew out of need rather than full up-front awareness and planning (these being developer level integration tests arising from TDD). The data usually started out for use in simple unit tests, and over time it was aggregated to form a scenario. The code to support the scenarios grew and morphed as the tests and the project code did.

Here is where YAML enters the picture. Very briefly, YAML (YAML Ain’t Markup Language) is a human friendly parse-able specification for which there exist parsers in multiple languages. In other words, perfect for these test data scenarios. There are many, many, awesome tutorials on it, so I will simply focus on how I used YAML to handle the scenario data. For the examples below, I am using YamlDotNet by Antoine Aubry. Though YAML.org identifies other .Net parsers, they have not been updated in quite some time. On top of that, YamlDotNet is very nice.

Here is the specification of a test scenario in YAML for example #4:

scenario 4 as yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
    id: 4
    node-content-format: Name
    edge-content-format: Empty
    description: scenario to test parallel edges and self-loops including an orphan
    raw-nodes: 3
    raw-edges: 4
    expected-nodes: 3
    expected-edges: 4
    expected-self-loops: 2

    nodes:
        - id:    0
          content: (LEELA)

        - id:    1
          content: (FRY)

        - id:    2
          content: (BENDER)

    edges:
        - id:         0
          source-id:  0
          target-id:  1
          content:    ()

        - id:         1
          source-id:  1
          target-id:  0
          content:    ()

        - id:         2
          source-id:  1
          target-id:  1
          content:    ()

        - id:         3
          source-id:  2
          target-id:  2
          content:    ()

    expected-structure: [0,1,0]
                        [1,1,0]
                        [0,0,1]

    expected-edge-structure: []  [1] [0]
                             [1] []  []
                             [1] []  []

This test data was parsed in to a Scenario class instance that was used to feed integration tests:

Scenario class structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
  public class Scenario
  {
    public Scenario()
    {
      Nodes = new List<NodeIndicator>();
      Edges = new List<EdgeIndicator>();
    }

    public int Id { get; set; }
    public string Description { get; set; }
    [YamlAlias("node-content-format")]
    public ScenarioNodeContentFormat NodeContentFormat { get; set; }
    [YamlAlias("edge-content-format")]
    public ScenarioEdgeContentFormat EdgeContentFormat { get; set; }
    public List<NodeIndicator> Nodes { get; set; }
    public List<EdgeIndicator> Edges { get; set; }
    [YamlAlias("raw-nodes")]
    public int NodeCountRaw { get; set; }
    [YamlAlias("raw-edges")]
    public int EdgeCountRaw { get; set; }
    [YamlAlias("expected-nodes")]
    public int NodeCountExpected { get; set; }
    [YamlAlias("expected-edges")]
    public int EdgeCountExpected { get; set; }
    [YamlAlias("expected-self-loops")]
    public int ExpectedSelfLoops { get; set; }
    [YamlAlias("expected-structure")]
    public string ExpectedStructure { get; set; }
    [YamlAlias("expected-edge-structure")]
    public string ExpectedEdgeStructure { get; set; }
  }

  //---------- 
  public class EdgeIndicator
  {
    [YamlAlias("source-id")]
    public int SourceId { get; set; }
    [YamlAlias("target-id")]
    public int TargetId { get; set; }
    public string Content { get; set; }
  }

  //---------- 
  public class NodeIndicator
  {
    public NodeIndicator()
    {
      Id = -1;
    }
    public int Id { get; set; }
    public string Content { get; set; }
  }

  //---------- 
  public enum ScenarioNodeContentFormat
  {
    Empty,
    Name,
  };

  //---------- 
  public enum ScenarioEdgeContentFormat
  {
    Empty,
    Name,
  };


Some explanations about the Scenario class:

  • NodeIndicator and EdgeIndicator are objects that ‘indicate’ how a node or edge are to be constructed or connected, not the nodes or edges themselves. It is up to the calling code that builds the graph to interpret this.
  • The Content property for each indicator represents a string into which different data can be associated with a node or edge. For example, each node has a content value of a name surrounded by parenthesis:
1
2
        - id:    0
          content: (LEELA)
  • The format of the node and edge Content field is specified by the NodeContentFormat and EdgeContentFormat properties (respectively)
  • The ExpectedStructure is a string representation of the expected structure of the graph once it is constructed. The string is a row-major representation of the graph as a matrix.
  • The ExpectedEdgeStructure is a string representation of the expected connectivity of the edges in the resulting graph. The format is as follows
    • Each row represents the edges indicent to a node starting at node index 0
    • The contents of each row are
      • First ‘[]’ is an array of edge ids for self loops (* self loops are not included in the inbound or outbound arrays ).
      • Second ‘[]’ is an array of edge ids for inbound edges
      • Third ‘[]’ is an array of edge ids for outbound edges

Now, if I need to add new data fields to support new tests, then only tricky thing is getting the YAML syntax correct in the data files and of course adding any new properties to the Scenario class. The parsing is handled by YamlDotNet no brain surgery is required. If only I’d found YAML earlier.

Deploy Crawler to EC2 With Scrapyd

This post will walk through using scrapyd to deploy and manage a scrapy spider from a local linux instance to a remote EC2 linux instance. The steps below were run on a local Ubuntu 12.04 instance any one of the free AWS linux Ubuntu 12.04 tiers. This is another step in the development of the scraping project described in the first post for which the indented goal is to have multiple instances of scrapy spiders crawling for extended periods of time and saving the items to a common database for further processing.

This post assumes you have scrapy and scrapyd installed on a local (linux) system, a working scrapy crawler, an AWS account, and have some knowledge of basic web security practices. Note: If you have not yet installed scrapy on your local system, when installing pip for scrapy, install the ‘setup tools’ and not the ‘distro tools’, as the setup tools will be needed by scrapyd for ‘egg-i-fying’ crawlers for deployment.

If you don’t have an AWS then get one set up first and spend some time reading about IAM best practices. In particular, don’t use your main account credentials for development. Create a new user in your AWS account and assign that user admin rights and do all development with that user. This separates you development credentials from your master account credentials (which have your credit card).

Setting up scrapy on a EC2 Ubuntu instance

These are the broad steps I took to get an Ubuntu Server 12.04 LTS (free tier) instance up and running with scrapy and scrapyd installed. They do not cover all the details. If this is your first time with AWS, there are plenty of docs and quality videos available – invest some time in them.

1) Log into your AWS account

2) Go to your EC2 Dashboard

3) Create a Security Group with the following inbound and outbound rules:

Inbound (for inbound ssh and scrapyd communication): image of ec2 inbound rules

Outbound (for outbound scrapyd and crawler http gets): image of ec2 outbound rules

4) Create a key value pair for the ‘dev’ user and download it so you have access to it on your local system

5) Launch a linux instance and associate the security group and key value pair with this instance, (we will use my-ec2.amazonaws.com as the public dns of the instance)

6) When the instance is running, connect to the EC2 instance using the key-value pair, connect to your instance

8) Install scrapy and scrapyd on the EC2 instance

Prepare the crawler for deployment

Scrapyd deployment targets are specified in a crawler’s scrapy.cfg file. The scrapyd commands to deploy a crawler need to be run in the root directory of your crawler project (or use fully qualified paths).

1) On your local system, go to the root folder of your crawler project

2) Edit scrapy.cfg by replacing any deploy sections so that the file looks like the following:

scrapy.cfg
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# http://doc.scrapy.org/en/latest/topics/scrapyd.html

[settings]
default = projectX.settings

[deploy:local-target]
url = http://localhost:6800/
project = projectX

[deploy:aws-target]
url = http://my-ec2.amazonaws.com:6800/
project = projectX

Here we have added two scrapyd deployment targets, one to localhost and one to the EC2 instance.

3) Verify the targets with the command:

1
scrapyd-deploy -l

you should see

1
2
local-target           http://localhost:6800/
aws-target             http://my-ec2.amazonaws.com:6800/

Deploy the crawler to EC2

Make sure scrapyd is running on the target instance. For our EC2 instance we can check this by using one of the scrapy web service commands or in a browser navigate to the scrapyd web page (format: my-ec2.amazonaws.com:6800). Note: after I had scrapyd installed on EC2, if I stopped and restarted the instance, scrapyd would be run on startup. When I tried to run scrapyd I got an error “port 6800 in use”. If you get a similar error check if scrapyd is already running.

1) To deploy the spider to the EC2 target use the scrapyd command:

1
scrapyd-deploy aws-target -p projectX

you should see:

1
2
3
4
Packing version 1396800856    
    Deploying to project "projectX" in http://my-ec2.amazonaws.com:6800/addversion.json
    Server response (200):
    {"status": "ok", "project": "projectX", "version": "1396800856", "spiders":1}

2) Verify the deployment using the scrapyd web service:

scrapyd command
1
$> curl http://my-ec2.amazonaws.com:6800/listprojects.json

you should see:

1
{"status": ok, "projects": ["projectX"]}

or you can refresh the scrapyd webpage on your remote instance and Available projects

Schedule and instance of the crawler

1) Use the scrapyd web service to schedule to the spider. Note: for the spiders I needed to run, each had multiple constructor parameters. To schedule a spider with constructor parameters, each parameter must be preceded by -d and they must be in the order as they appear in the spider constructor. For this example, the constructor parameters are session_id, seed_id, and seed_url. The particular spider to be used is spider2b:

scrapyd command
1
curl http://my-ec2.amazonaws.com:6800/schedule.json -d project=projectX -d spider=spider2b -d session_id=33 -d seed_id=54 -d seed_url=http://www.blah.com/

you should see a response similar to:

1
{"status": "ok", "jobid": "c77c633ac28011e3ac0a0a32be087ede"}

2) Verify the job:

scrapyd command
1
curl http://my-ec2.amazonaws.com:6800/listjobs.json?project=projectX

will show:

1
{"status": "ok", "running": [{"id": "c77c633ac28011e3ac0a0a32be087ede", "spider": "spider2b"}], "finished": [], "pending": []}

or you can refresh the scrapyd webpage on your remote instance and check Jobs

3) To stop a job, use the following

scrapyd command
1
curl http://my-ec2.amazonaws.com:6800/cancel.json -d project=simple -d job=c77c633ac28011e3ac0a0a32be087ede

you should see:

1
{"status": "ok", "prevstate": "running"}

Note that the schedule and cancel commands of the scrapyd webservice are not immediate, so give them a bit to take affect. When getting started, the scrapyd web page on the EC2 instance is the easiest way of viewing the logs and output items.

Scrapy to MongoDB

Now that we can crawl some sites (based on the previous posts), we need to persist the data somewhere where it can be retrieved and processed for analysis. Recall that the goal of the scraper was to harvest connections between domains in order to generate a model of a specific industry’s on-line presence. In Scrapy, Item Pipelines are the prescribed mechanism by which scraped items can be persisted to a data store. In this post, we will be using a custom pipeline extension for MongoDB which we will further customize to check for duplicates. The code examples in this post build upon those in the previous examples. It is also assumed that you have some familiarity with MongoDB and have a working (basic) crawl spider. It does not cover installing Scrapy, MongoDB, or any dependencies, (there is plenty of good documentation on these subjects). Before continuing, you will need to install the following (in addition to having Scrapy working):

MongoDB
Scrapy-MongoDB, an item pipeline extension written by Sebastian Dahlgren

Once installed, the first step will be to get Scrapy-MongoDB working and saving to a collection. Once you’ve got MongoDB installed, create a database named scrapy and within it, a collection named items. You can use the crawl spider from the previous posts and update the settings.py file to use Scrapy-MongoDB. A quick note about MongoDB ids: mongo will automatically add a unique id object with each item if and id field is not specified. We will use default id and benefit from it because the default MongoDB id object contains an embedded date-time stamp. Once you get Scrapy-MongoDB working and you are saving to a collection, we need to extend the MongoDBPipeline class to include behavior to check for duplicates. A duplicate of the example item is one in which the following fields match:

– Session_id
– Current_url
– Referring_url

The session_id is used to group items from different harvests. We want to persist a single connection between two given urls for each harvest session. The current_url and referring_url will be used to represent the connection between two domains and will be used to generate a directed graph for the model, (a connection between two valid business backed domains can be used to infer a relationship which in turn can be used in social network analysis, more on this later …).

The new class will be called CustomMongoDBPipeline and should be placed within the pipelines.py file in the scrapy project folder. You can keep the default pipeline initialized with the project in the same file and switch back by changing the settings file.

pipelines.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from datetime import datetime
from scrapy.exceptions import DropItem
from scrapy_mongodb import MongoDBPipeline

class CustomMongoDBPipeline(MongoDBPipeline):

    def process_item(self, item, spider):
        # the following lines are a duplication of MongoDBPipeline.process_item()
        if self.config['buffer']:
            self.current_item += 1
            item = dict(item)

            if self.config['append_timestamp']:
                item['scrapy-mongodb'] = { 'ts': datetime.datetime.utcnow() }

            self.item_buffer.append(item)

            if self.current_item == self.config['buffer']:
                self.current_item = 0
                return self.insert_item(self.item_buffer, spider)
            else:
                return item

        # if the connection exists, don't save it
        matching_item = self.collection.find_one(
            {'session_id': item['session_id'],
             'referring_url': item['referring_url'],
             'current_url': item['current_url']}
        )
        if matching_item is not None:
            raise DropItem(
                "Duplicate found for %s, %s" %
                (item['referring_url'], item['current_url'])
            )
        else:
            return self.insert_item(item, spider)

Lines 9-22 are a duplication of the parent class MongoDBPipeline.process_item() method to maintain compatability with the Scrapy-MongoDB configuration options. Line 25 is where we retrieve an entry that matches the specified fields. If an entry is found, then the current item is dropped and we return to crawling. The settings are:

settings.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
BOT_NAME = 'farm2'

SPIDER_MODULES = ['farm2.spiders']
NEWSPIDER_MODULE = 'farm2.spiders'

#set to 0 for no depth limit
DEPTH_LIMIT = 3
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS_PER_DOMAIN = 4

ITEM_PIPELINES = {
    'farm2.pipelines.CustomMongoDBPipeline': 100
}

# 'scrapy_mongodb.MongoDBPipeline' connection
MONGODB_URI = 'mongodb://localhost:27017'
MONGODB_DATABASE = 'scrapy'
MONGODB_COLLECTION = 'items'

Recall you can set the session_id from the command line

1
scrapy crawl farm2 -a session_id=337

Scrapy After the Tutorials Part 2

As described in the previous post, the scraper will be crawling over extended periods of time. Filtering out links which do not need to be crawled is one way of improving performance: less links to crawl = less items to process. In this example, I hope to demonstrate how to filter the links to be crawled. So how does the scraper determine which links to follow and how can we modify this behavior? Using step #7 in the Data Flow as reference:

The Spider processes the Response and returns scraped Items and new Requests (to follow) to the Engine.

So when a response is sent to the spider for processing, the spider will first extract all links in the response and send them to the engine to be scheduled for crawling. The spider rules control which links will be passed to the engine for crawling. Each rule uses a link extractor with which you can specify the links to follow or drop using the available parameters. Many of the basic examples of crawl spiders demonstrate the ‘allow’ and ‘deny’ parameters for the rules. But for this example, we need a bit more functionality.

By default, for each response processed by the spider, the set of urls send to the engine for crawling will include every href tag in the response. To limit the amount of links that will be followed, we can remove the following types of links:

– any urls to the same domain as the response.url
– any relative urls

To remove urls to the same domain, we will implement a filter_links() for the crawling Rule. To remove relative urls, we will add a regex to the link extractor to limit all urls returned to fully qualified urls, (e.g. start with http: or https:). The Item has not changed from the last post.

Spider
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from urlparse import urlparse
from farm1.items import Farm1Item

class Harvester2(CrawlSpider):
    name = 'Harvester2'
    session_id = -1
    response_url = ""
    start_urls = ["http://www.mmorpg.com"]
    rules = (
        Rule (
            SgmlLinkExtractor(
                allow=("((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)", ),),
            callback="parse_items",
            process_links="filter_links",
            follow= True),
    )

    def __init__(self, session_id=-1, *args, **kwargs):
        super(Harvester2, self).__init__(*args, **kwargs)
        self.session_id = session_id

    def parse_start_url(self, response):
        self.response_url = response.url
        return self.parse_items(response)

    def parse_items(self, response):
        self.response_url = response.url
        sel = Selector(response)
        items = []
        item = Farm1Item()
        item["session_id"] = self.session_id
        item["depth"] = response.meta["depth"]
        item["title"] = sel.xpath('//title/text()').extract()
        item["current_url"] = response.url
        referring_url = response.request.headers.get('Referer', None)
        item["referring_url"] = referring_url
        items.append(item)
        return items

    def filter_links(self, links):
        baseDomain = self.get_base_domain( self.response_url)
        filteredLinks = []
        for link in links:
            if link.url.find(baseDomain) < 0:
                filteredLinks.append(link)
        return filteredLinks

    def get_base_domain(self, url):
        base = urlparse(url).netloc
        if base.upper().startswith("WWW."):
            base = base[4:]
        elif base.upper().startswith("FTP."):
            base = base[4:]
        # drop any ports
        base = base.split(':')[0]
return base

Note: you can add this spider to the previous project by dropping it in the same folder as the other spider. As long as the file name and spider name are different, it will run with the same item and settings.

The response_url class variable (line 10) will maintain the current response.url and will be needed for filtering urls.

The parse_start_url() method (line 25) overrides the base definition and is called only for the defined start_urls and processes each start url as an item. Most importantly, it initializes the response_url variable before any crawling.

The parse_item() method has not changed from the last example.

The get_base_domain() method (line 51) returns the base domain for a url. For example, if ‘http ://www.b.com:334/x/y.htm?z’ is passed in, ‘b.com’ will be returned.

The filter_links() method will filter all links passed in and only return links to external domains. The base domain of the response_url is retrieved by get_base_domain() method (line 44). Any links in the input list of links which contain this base domain are filtered out. Only links to different (external) domains are returned. This method is assigned to the process_links parameter in the Rule (line line 40).

The allow parameter for the link extractor (line 15) is assigned the regular expression “((mailto:|(news|(ht|f)tp(s?))://){1}\S+)” which precludes any relative urls by only allowing urls that start with a URI Scheme (e.g. ftp, http, etc.).

When a response is sent to a spider for processing, the filter_links() method is called before the process_item() method. In this example, if the response_url variable is not set, it will fail. This is the reason we needed to use the parse_start_url() method, which is called for each defined start_url and allows us to set the response_url variable before any links are processed by filter_links(). This only allows fully qualified links to external urls to be sent to the engine for scheduled crawling.

Scrapy After the Tutorials Part 1

I was given the task of building a scalable web scraper to harvest connections between domains of a specific industry in order to generate a network model of that industry’s online presence. The scraper will need to run for a period of time, (say a week), and the resulting harvest would be the raw data from which the model would be generated. This harvesting process will need to be repeatable to create ‘snapshots’ of the network for future longitudinal analysis. The implementation will use scrapy with a MongoDB back end on a linux platform to be run (ultimately) in AWS.

This post (and subsequent ones) will provide some code samples and documentation of issues faced during the process of getting the scraper operational. The intent is to help bridge the gap between the initial scrapy tutorials and real-world code. The examples assume you have scrapy installed and running, and have at least worked through the basic tutorials. I did find many of the tutorials on the wiki very helpful and worked through several of them, (multiple times). Get a basic crawl spider up and running and then pick up here.

In this example, I hope to demonstrate the the following scrapy features:

-Adding a spider parameter and using it from the command line
-Getting current crawl depth and the referring url
-Setting crawl depth limits

The code samples below were from a scrapy project named farm1

The item:

item
1
2
3
4
5
6
7
8
from scrapy.item import Item, Field

class Farm1Item(Item):
    session_id = Field()
    depth = Field()
    current_url = Field()
    referring_url = Field()
    title = Field()

session_id: a unique session id for each scrapy run or harvest
depth: the depth of the current page with respect to the start url
current_url: the url of the current page being processed
referring_url: the url of the site which was linked to the current page

The crawl spider:

spider
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from farm1.items import Farm1Item

class Harvester1(CrawlSpider):
    name = 'Harvester1'
    session_id = -1
    start_urls = ["http://www.example.com"]
    rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                callback="parse_items",  follow= True),
    )

    def __init__(self, session_id=-1, *args, **kwargs):
        super(Harvester1, self).__init__(*args, **kwargs)
        self.session_id = session_id

    def parse_items(self, response):
        sel = Selector(response)
        items = []
        item = Farm1Item()
        item["session_id"] = self.session_id
        item["depth"] = response.meta["depth"]
        item["current_url"] = response.url
        referring_url = response.request.headers.get('Referer', None)
        item["referring_url"] = referring_url
        item["title"] = sel.xpath('//title/text()').extract()
        items.append(item)
        return items

The spider uses the SgmlLinkExtractor and follows every link, (a later post will cover filtering which links to follow).

Adding a spider parameter and using it from the command line

Lines 14-16 in the spider shows the constructor which has a session_id parameter with a defaul assignment. The session id is assigned to the spider and persisted with each item processed and will be used to identify items from different harvest sessions. Defining the parameter in the constructor allows using it from the command line:

1
x:~$ scrapy crawl Harvester7 -a session_id=337

Getting current crawl depth and the referring url

Line 23 is where the current depth of the crawl is retrieved. The ‘depth’ key word is one of several predefined keys in the meta dictionary for the request and response objects. The depth value will be used in the analysis phase.

Line 25 shows how to retrieve the referring url. The goal of this project was harvesting connections between domains, not just data from individual pages. For each item processed, the connection between two links: referring_url –> current_url is stored. Implied with each connection is the directionality of the link which will help build a directed graph for analysis. This is enabled by default in the default middleware settings, (and note that ‘Referer’ is the correct usage).

Setting crawl depth limits

Rather than use ctrl-c to kill the crawling spider, you can set a depth limit at which the spider will not go beyond. I found this helpful during testing. DEPTH_LIMIT is a predefined setting that can be assigned in your settings file. This was the only additional setting used in this example other than the defaults created with the project.

1
DEPTH_LIMIT = 3

The depth limit can also be set in the command line, (as can all pre-defined settings):

1
x:~$ scrapy crawl Harvester1 –s DEPTH_LIMIT=2

The command line assignment will take priority of the settings file. Note that the depth limit will have little affect if you are running a broad crawl. Ultimately, when this scraper is released into the wild, it will probably be set to run a broad crawl. But it will still need to troll deep to really pull out much of a domain’s public connections. As to how this will be handled, (a mix of broad + deep crawls), I am not sure but it will be documented here once it is figured out.