Sunday, July 16, 2006

Nothing Is Ever Easy

Ok, so I've been working on a little C# windows application that creates SQL scripts to insert (or update) data in a database, based on an existing database and it's contents.

To be honest, my boss wrote a similar utility in VB6 years ago which we still use today. The problem is it doesn't understand some of the new SQL Server 2005 data types (like XML or varchar(max)).

I was simply going to update the existing application, but this proved problematic. Firstly, because a colleague has had the source code checked out for months and since he works in another city it's quite easy for him to ignore my email requests to check it back in (we don't allow multiple check outs). Secondly, the code is a bit of a mess. It was never written that well and it's been hacked about by several people (including me) since it was first written. Finally, it's always been a poor application anyway, with no user input validation or error handling, and it outputs it's script to the 'current directory', which in Windows is god-knows-where-at-any-given-time-except-that-it-won't-be-where-you-think-it-is-oh-no-that-would-be-too-easy.

My other problems with it are that it can only output one script per table, and for some installs I'm writing (using InstallAware) I'd really like one script that inserts data for multiple tables. You also can't save your settings, which means you have to script every table individually, and set up the correct settings (turn on IDENTITY_INSERT, truncate table first etc) for each table every time.

After I looked at the list of problems and the state of the current code, I decided a re-write was in order, and I might as well do it in .NET. Besides, this gave me a chance to try out some development techniques and .NET classes that I haven't had much use for (or time to investigate) previously, such as asynchronous UI, StreamWriter etc.

I figured the application was pretty simple (the VB6 version certainly was) and this would be a relatively small job. Oh, how wrong I was.

Firstly, this was my first experience using the TableLayout Control since Beta 1. It's a little bit different, but mostly I had problems simply because I hadn't used the control much at all and while it's great at what it does, getting it to do exactly what I wanted wasn't always intuitive. I did get my desired out come with just a little effort though.

Next, I had issues with serialisation. I've looked at .NET serialisation once before, but had several problems with it and in the end my boss and I decided to implement our own form of serialisation in that project. Given this new utility wasn't exactly mission critical or urgent (although I did need it), I figured I could take the time to explore serialisation and get it right. Well, turns out I didn't have much trouble making my classes (in particular a custom dictionary class using generics) compatible with binary serialisation. In fact, it pretty much just worked. Very cool.

What sucked was the XML serialisation. For reasons I just couldn't explain, the generic KeyValuePair<> object keep writing out a KeyValuePairKeyTypeValueType node with no attributes or sub-nodes. This meant my data was never actually serialised properly. I couldn't get it to work no matter what I tried, even after I added specialised constructors and so on to make it work. I also had to add an ugly Add method overload because I got compiler warnings saying a method with that signature was required at ALL levels if the inheritance hierarchy to support serialisation. I was not happy. In the end I managed to make it work by implementing System.Xml.Serialization.IXmlSerializable and manually serialising the required data. This worked but the time I got to this point I wasn't exactly delirious with success.

Then I had a few problems (and still do) with memory usage. My new application can create scripts for multiple tables at once. All tables are scripted asynchronously and can be output to a separate script file per table or a single script file for all tables.

If each table is sent to a separate script file, then the data is written directly to a file stream as the script is built and there are no problems. However, if a single script is being created then the system scripts each table to a memory stream. When all the tables have been scripted they are sorted based on foreign-key references (so referenced tables have their data inserted first), and then each script is written to the file in the order determined.

The problem here is that tables with many/extremely large rows can generate very large scripts. On my development PC (an Acer P4 laptop running Windows XP Professional with 1gb of RAM) I get an OutOfMemoryException around about 160mb of data written to the stream. A similar problem seems to occur with very large strings, or placing very large strings into the Text property of TextBox controls. The interesting thing is that I still have plenty of physical RAM left, not to mention swap-file space, so I must be hitting some internal limit. On top of this, I can quite easily catch and handle the OutOfMemoryException, which is something I've heard implied is quite hard to do properly. I can only presume this is because the MemoryStream object itself has some internal limit, or that object specifically is being denied more memory, rather than my application (which makes plenty of allocations after the exception is thrown/caught). In the end this isn't really a problem for us, since we won't be using this application to create scripts that big, but it is still annoying. I guess I can still change the code so it uses temporary files and combines them all at the end, but that doesn't seem very cool. So far, I haven't found anything that mentions what memory limit I might be hitting or how to increase it. In fact, a number of documents I've found say the size of the stream should be limited only by the amount of memory available.

The most disappointing problem I had though was the difficulty in obtaining schema information from a database. By the time I got this far through the code, I actually needed to finish it quickly so I could use it to generate my install scripts. As a result I didn't spend a whole lot of time researching this area, so it's quite possible there's something simple I missed.

That being said, it wasn't as easy as it was in VB6. Our old code simply opened an ADODB.RecordSet containing the data to be scripted. From the RecordSet object we could access not only the data, but also some meta-data about the columns, such as names, maximum lengths, database data-types and so on.

At first I tried the same thing with the SqlDataReader in .NET. This seemed like a good choice since I only needed forward-only access to the data and the SqlDataReader is quite efficient. Unfortunately I found I could pretty much only obtain the column names and values from the data reader, no maximum lengths or SQL data types, nor anything about foreign/primary keys etc.

I then discovered GetSchemaTable method, but this didn't help either. The function returns a DataTable with a row per column in the SqlDataReader. This isn't really a helpful format in my case. I'd really like the metadata available to me as I loop through the columns in the SqlDataReader, to do this with the DataTable I have to 'find' the row for the column I'm working on each time I change columns. That seems inefficient and is more code than we needed in VB6. Besides which, many of the columns seemed to return nonsense data. Looking in the help I found a comment stating that several of the (useful) columns in the DataTable always return zero (or some other useless value).

I should note I have since seen the GetProviderSpecificFieldType which might provide the database data type of a column, although I haven't tried it yet. I also don't know why I didn't notice it the first time round - domestic blindness I guess.

I also needed foreign and primary key information as well as the maximum length of each field (if it was a char, varchar, nvarchar, nchar, binary, varbinary etc.). I couldn't find any good way of retrieving all this data using the .NET classes, so I ended up executing SQL statements against the system tables in SQL Server and caching the results into collections before I began the scripting process.

This works, and is quite fast, but isn't very cool. It's not cool simply because I don't like resorting to queries against SQL Server system objects (ever), it's a philosophical issue I have. Mostly, it's not cool because the application now only works against SQL Server (although hopefully I've got it 2000/2005 compatible - haven't tested it against 2000 yet). The old VB6 application would work with just about any database you could access through ADODB, at least for the input.

.NET hadn't finished thwarting me yet either. My code uses a StreamWriter object to output the script data to either an underlying FileStream or MemoryStream. In the case of a MemoryStream, I wanted to leave the stream open until all of the scripts were completed (so I could read the contents back and output then to a file stream), but I had no further use for the StreamWriter object. You'd think I could just keep a pointer to the MemoryStream and manually dispose/close the StreamWriter, or even let the StreamWriter go out of scope. Not so. The StreamWriter class seems to have an unwanted behaviour in that it closes/disposes it's underlying stream when it is disposed, and since it correctly implements the dispose pattern this also occurs when it is garbage collected. The end result was that I had to keep a pointer to the StreamWriter object and leave both it and it's underlying stream in memory until I was done. This sucks. What's more I think this is incorrect behaviour in terms of the dispose pattern implementation, since one of the rules is that you should never dispose any object you don't own - StreamWriters (in my opinion) do not own the stream they are asked to write to.

The bottom line is the whole thing isn't cool because it was just easier in VB6.

Now to be fair, C# does make some things easier. The TableLayout control made the dialog resizing easier, even if I did have some problems with it. The threading objects in .NET made the asynchronous UI and script creation much easier than in VB6. The StreamWriter and various stream objects, combined with inheritance and polymorphism, made outputting scripts in various encodings and to various places easier than in VB6 as well. Of course there's hundreds of other things as well that I, and everyone else, use daily (or less often) in C# that are so much better or easier than they were in C#. Infact, I love .NET (and even C#, despite semi-colons and case-sensitivity).

It's just a shame things that used to be simple, aren't always as easy now.

Anyway, version 1 of the application is finished. When I finally get some web space and a domain sorted for Yortsoft I'll post the application as freeware. It isn't perfect, but it's good enough for most things. In the meantime I'll keep plugging away at the issues that remain. Here's looking forward to version 2 :)

PS: The Table Data Scripter v1.0 is now available for download here.

No comments:

Post a Comment