Skip to content

arsnyder16/CsvReader

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSV Reader

The CsvReader library is an extended version of Sebastian Lorien's fast CSV Reader project and provides fast parsing and reading of CSV files

To this end it is a straight drop-in replacement for the existing NuGet package LumenWork.Framework.IO, but with additional capabilities; the other rationale for the project is that the code is not available elsewhere in a public source repository, making it difficult to extend/contribute to.

Welcome to contributions from anyone.

You can see the version history here.

Build the project

  • Windows: Run build.cmd

I have my tools in C:\Tools so I use build.cmd Default tools=C:\Tools encoding=UTF-8

Library License

The library is available under the MIT License, for more information see the License file in the GitHub repository.

Getting Started

A good starting point is to look at Sebastian's article on Code Project.

A basic use of the reader something like this...

    using System.IO;
    using LumenWorks.Framework.IO.Csv;

    void ReadCsv()
    {
        // open the file "data.csv" which is a CSV file with headers
        using (var csv = new CachedCsvReader(new StreamReader("data.csv"), true))
        {
            // Field headers will automatically be used as column names
            myDataGrid.DataSource = csv;
        }
    }

Having said that, there are some extensions built into this version of the library that it is worth mentioning.

Additional Features

Columns

One addition is the addition of a Column list which holds the names and types of the data in the CSV file. If there are no headers present, we default the column names to Column1, Column2 etc; this can be overridden by setting the DefaultColumnHeader property e.g.

    using System.IO;
    using LumenWorks.Framework.IO.Csv;

    void ReadCsv()
    {
        // open the file "data.csv" which is a CSV file with headers
        using (var csv = new CachedCsvReader(new StreamReader("data.csv"), false))
        {
            csv.DefaultColumnHeader = "Fred"

            // Field headers will now be Fred1, Fred2, etc
            myDataGrid.DataSource = csv;
        }
    }

You can specify the columns yourself if there are none, and also specify the expected type; this is especially important when using against SqlBulkCopy which we will come back to later.

    using System.IO;
    using LumenWorks.Framework.IO.Csv;

    void ReadCsv()
    {
        // open the file "data.csv" which is a CSV file with headers
        using (var csv = new CachedCsvReader(new StreamReader("data.csv"), false))
        {
            csv.Columns.Add(new Column { Name = "PriceDate", Type = typeof(DateTime) });
            csv.Columns.Add(new Column { Name = "OpenPrice", Type = typeof(decimal) });
            csv.Columns.Add(new Column { Name = "HighPrice", Type = typeof(decimal) });
            csv.Columns.Add(new Column { Name = "LowPrice", Type = typeof(decimal) });
            csv.Columns.Add(new Column { Name = "ClosePrice", Type = typeof(decimal) });
            csv.Columns.Add(new Column { Name = "Volume", Type = typeof(int) });

            // Field headers will now be picked from the Columns collection
            myDataGrid.DataSource = csv;
        }
    }

SQL Bulk Copy

One use of CSV Reader is to have a nice .NET way of using SQL Bulk Copy (SBC) rather than bcp for bulk loading of data into SQL Server.

A couple of issues arise when using SBC 1. SBC wants the data presented as the correct type rather than as string 2. You need to map between the table destination columns and the CSV if the order does not match exactly

Below is a example using the Columns collection to set up the correct metadata for SBC

	public void Import(string fileName, string connectionString)
	{
		using (var reader = new CsvReader(new StreamReader(fileName), false))
		{
			reader.Columns = new List<LumenWorks.Framework.IO.Csv.Column>
			{
				new LumenWorks.Framework.IO.Csv.Column { Name = "PriceDate", Type = typeof(DateTime) },
				new LumenWorks.Framework.IO.Csv.Column { Name = "OpenPrice", Type = typeof(decimal) },
				new LumenWorks.Framework.IO.Csv.Column { Name = "HighPrice", Type = typeof(decimal) },
				new LumenWorks.Framework.IO.Csv.Column { Name = "LowPrice", Type = typeof(decimal) },
				new LumenWorks.Framework.IO.Csv.Column { Name = "ClosePrice", Type = typeof(decimal) },
				new LumenWorks.Framework.IO.Csv.Column { Name = "Volume", Type = typeof(int) },
			};

			// Now use SQL Bulk Copy to move the data
			using (var sbc = new SqlBulkCopy(connectionString))
			{
				sbc.DestinationTableName = "dbo.DailyPrice";
				sbc.BatchSize = 1000;

				sbc.AddColumnMapping("PriceDate", "PriceDate");
				sbc.AddColumnMapping("OpenPrice", "OpenPrice");
				sbc.AddColumnMapping("HighPrice", "HighPrice");
				sbc.AddColumnMapping("LowPrice", "LowPrice");
				sbc.AddColumnMapping("ClosePrice", "ClosePrice");
				sbc.AddColumnMapping("Volume", "Volume");

				sbc.WriteToServer(reader);
			}
		}
	}

One other issue arose recently on a project where we wanted to use SBC but some of the data was not in the file itself but metadata that needed to be included on every row. The solution was to amend the CSV reader and Columns collection to allow default values to be provided that are not in the data itself.

The additional columns should be added at the end of the Columns collection to avoid interfering with the parsing, see the amended example below...

	public void Import(string fileName, string connectionString)
	{
		using (var reader = new CsvReader(new StreamReader(fileName), false))
		{
			reader.Columns = new List<LumenWorks.Framework.IO.Csv.Column>
			{
			    ...
				new LumenWorks.Framework.IO.Csv.Column { Name = "Volume", Type = typeof(int) },
				// NB Fake column so bulk import works
                new LumenWorks.Framework.IO.Csv.Column { Name = "Ticker", Type = typeof(string) },
			};

			// Fix up the column defaults with the values we need
            reader.UseColumnDefaults = true;
			reader.Columns[reader.GetFieldIndex("Ticker")] = Path.GetFileNameWithoutExtension(fileName);

			// Now use SQL Bulk Copy to move the data
			using (var sbc = new SqlBulkCopy(connectionString))
			{
				...
				sbc.AddColumnMapping("Ticker", "Ticker");

				sbc.WriteToServer(reader);
			}
		}
	}

To give an idea of performance, this took the naive sample app using an ORM from 2m 27s to 1.37s use SBC and the full import took just over 11m importing 9.8m records.

Performance

One of the main reasons for using this library is its excellent performance on reading/parsing raw data, here's a recent run of the benchmark (which is in the source)

Test pass #1 - All fields

CsvReader - No cache      : 3495429 ticks, 1.7940 sec., 24.5265 MB/sec.
CachedCsvReader - Run 1   : 6641089 ticks, 3.4084 sec., 12.9091 MB/sec.
CachedCsvReader - Run 2   : 4393 ticks, 0.0023 sec., 19515.3071 MB/sec.
TextFieldParser           : 36877894 ticks, 18.9270 sec., 2.3247 MB/sec.
Regex                     : 15011358 ticks, 7.7044 sec., 5.7111 MB/sec.

Test pass #1 - Field #72 (middle)

CsvReader - No cache      : 2085871 ticks, 1.0705 sec., 41.1007 MB/sec.
CachedCsvReader - Run 1   : 6205399 ticks, 3.1848 sec., 13.8155 MB/sec.
CachedCsvReader - Run 2   : 214 ticks, 0.0001 sec., 400610.9533 MB/sec.
TextFieldParser           : 36458115 ticks, 18.7116 sec., 2.3515 MB/sec.
Regex                     : 6976827 ticks, 3.5808 sec., 12.2879 MB/sec.


Test pass #2 - All fields

CsvReader - No cache      : 3431492 ticks, 1.7612 sec., 24.9835 MB/sec.
CachedCsvReader - Run 1   : 6110812 ticks, 3.1363 sec., 14.0294 MB/sec.
CachedCsvReader - Run 2   : 173 ticks, 0.0001 sec., 495553.4335 MB/sec.
TextFieldParser           : 36671647 ticks, 18.8212 sec., 2.3378 MB/sec.
Regex                     : 15064341 ticks, 7.7315 sec., 5.6910 MB/sec.

Test pass #2 - Field #72 (middle)

CsvReader - No cache      : 2162568 ticks, 1.1099 sec., 39.6430 MB/sec.
CachedCsvReader - Run 1   : 5135074 ticks, 2.6355 sec., 16.6951 MB/sec.
CachedCsvReader - Run 2   : 220 ticks, 0.0001 sec., 389685.2000 MB/sec.
TextFieldParser           : 36913575 ticks, 18.9453 sec., 2.3225 MB/sec.
Regex                     : 7107108 ticks, 3.6476 sec., 12.0627 MB/sec.


Test pass #3 - All fields

CsvReader - No cache      : 3552781 ticks, 1.8234 sec., 24.1306 MB/sec.
CachedCsvReader - Run 1   : 5668694 ticks, 2.9094 sec., 15.1235 MB/sec.
CachedCsvReader - Run 2   : 186 ticks, 0.0001 sec., 460917.9785 MB/sec.
TextFieldParser           : 36650220 ticks, 18.8102 sec., 2.3392 MB/sec.
Regex                     : 15108079 ticks, 7.7540 sec., 5.6745 MB/sec.

Test pass #3 - Field #72 (middle)

CsvReader - No cache      : 2212999 ticks, 1.1358 sec., 38.7396 MB/sec.
CachedCsvReader - Run 1   : 5246701 ticks, 2.6928 sec., 16.3399 MB/sec.
CachedCsvReader - Run 2   : 214 ticks, 0.0001 sec., 400610.9533 MB/sec.
TextFieldParser           : 36718316 ticks, 18.8451 sec., 2.3348 MB/sec.
Regex                     : 7049832 ticks, 3.6182 sec., 12.1607 MB/sec.


Done

This was run on a high-spec machine (Xeon E5-2620, 16Gb RAM and 512Gb SSD; you have to have some toys!) so the overall thoughput would be good, but CsvReader performs at 10x the speed the TextFieldParser and 5x faster than Regex

About

Extended version of Sebastian Lorien's fast CSV Reader

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C# 98.6%
  • Other 1.4%