Parsing large XML files

Today I had to parse large xml files and in various ways: getting the value of an attribute, counting elements, and getting all elements and their contents.

There are various ways of parsing xml files: using DataSet, XmlSerializer, XPathDocument, XmlDocument, XDocument and XmlTextReader. Now because my xml files could be large, it was obvious that I did not want to load them completely in memory before parsing them.

Obviously the choice of method would then be XmlTextReader, because this allows you to advance to the next nodes and attributes without having to load the file completely.

For simplicity, assume that my xml file looks like this:

<?xml version="1.0" encoding="utf-8" ?> 
<personImport batchType="PmH">
<persons>
<person>
<senderData>
<name>Lincoln</name>
<fileCreationDate>25/09/2012</fileCreationDate>
</senderData>
<personData>
<lastName>Steward</lastName>
<firstName>Michael</firstName>
</personData>
</person>
<person>
<senderData>
<name>Mercator</name>
<fileCreationDate>25/09/2012</fileCreationDate>
</senderData>
<personData>
<lastName>Miles</lastName>
<firstName>David</firstName>
</personData>
</person>
</persons>
</personImport>

The first thing I needed to do was to retrieve the batchType value from the root element. So I created an ImportFileTypeResolver to do the work:

public class ImportFileTypeResolver : IImportFileTypeResolver
{
public ImportFileType Resolve(string filePath)
{
using (var reader = new XmlTextReader(filePath))
{
reader.ReadToFollowing("personImport");
switch (reader.GetAttribute("batchType"))
{
case "PmH": return ImportFileType.Person;
default: return ImportFileType.Unknown;
}
}
}
}
 
As you see there are a number of convenience methods that make it very easy to advance to the next element/attribute, using ReadToFollowing and MoveToAttribute.
 
The next thing was to return the number of person elements, so I created a PersonImportFileJobCountResolver:
 
public class PersonImportFileJobCountResolver : IImportFileJobCountResolver
{
public int GetNumberOfJobs(string filePath)
{
using (var reader = new XmlTextReader(filePath))
{
var nodeCount = 0;
while (reader.ReadToFollowing("person")) nodeCount++;
return nodeCount;
}
}
}
 
Again, all I needed to do was use the ReadToFollowing operation until the file has been read completely and for each iteration increase the counter.
 
The third thing I wanted to do was to retrieve all person elements as Person objects. To do that I created a PersonImportBatchFileExtractor as follows:
 
public interface IImportBatchFileExtractor<TEntity> : IDisposable
{
TEntity ExtractNext();
}

 
public class PersonImportBatchFileExtractor : IImportBatchFileExtractor<PersonImportBatchFileExtractor.Person>
{
private XmlReader _xmlReader;

public Person ExtractFirst(string personImportBatchFilePath)
{
_xmlReader = new XmlTextReader(personImportBatchFilePath);
return ExtractNext();
}

public Person ExtractFirst(XDocument data)
{
var stream = new MemoryStream();
data.Save(stream);
stream.Position = 0;
_xmlReader = XmlReader.Create(stream);
return ExtractNext();
}

public Person ExtractNext()
{
if (_xmlReader == null) throw new ApplicationException("Call ExtractFirst before calling ExtractNext");

_xmlReader.ReadToFollowing("person");
if (_xmlReader.NodeType == XmlNodeType.None) return null;

var xDocument = XDocument.Parse(_xmlReader.ReadOuterXml());
if (xDocument.Root == null) throw new ApplicationException("Something went wrong during parsing a person");

var name = xDocument.Root.XPathSelectElement("senderData/name").Value;
var fileCreationDate = xDocument.Root.XPathSelectElement("senderData/fileCreationDate").Value;
var lastName = xDocument.Root.XPathSelectElement("personData/lastName").Value;
var firstName = xDocument.Root.XPathSelectElement("personData/firstName").Value;

return new Person(xDocument, new SenderData(name, DateTime.Parse(fileCreationDate)), new PersonData(lastName, firstName));
}

public void Dispose()
{
if (_xmlReader!=null) _xmlReader.Close();
}

public class Person
{
public Person(XDocument xmlRepresentation, SenderData senderData, PersonData personData)
{
XmlRepresentation = xmlRepresentation;
SenderData = senderData;
PersonData = personData;
}
public XDocument XmlRepresentation { get; set; }
public SenderData SenderData { get; set; }
public PersonData PersonData { get; set; }
}

public class SenderData
{
public SenderData(string name, DateTime fileCreationDate)
{
Name = name;
FileCreationDate = fileCreationDate;
}
public string Name { get; set; }
public DateTime FileCreationDate { get; set; }
}

public class PersonData
{
public PersonData(string firstName, string lastName)
{
FirstName = firstName;
LastName = lastName;
}
public string FirstName { get; set; }
public string LastName { get; set; }
}
}

 
This simply looks for the next person element, creates an XDocument based on it and then uses xpath to find the elements within that person to build a Person object and return it.
 
Usage is like:
 
using (var personImportBatchFileExtractor 
= new PersonImportBatchFileExtractor())
{
var person = personImportBatchFileExtractor.ExtractFirst
(@"C:\temp\batchimport\pmh.xml");
while (person != null)
{
person = personImportBatchFileExtractor.ExtractNext();
}
};

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s