GZip EmbeddedResource in MSBuild with RoslynCodeTaskFactory
TLDR: See title and github / git
Versioned package and dependency management is key in reproducible software and
machine learning development. There are many ways to handle digital assets for
this including very big and complicated cloud infrastructure. I prefer to keep
things simple and as much as possible keep these assets as simple text files in
a git repository and distribute them via package management. For example for
machine learning keep annotation data in simple csv
-files.
For .NET a simple way to do this is to simply embed these assets into an assembly, and distribute it via NuGet or similar package feeds like Azure Artefacts.
Approach
The basic approach is to have a single git repository for one or more digital assets (e.g. text files with an image file name and a corresponding set of labels or similar for machine learning), and then define a simple .NET class library which embeds these assets into it and perhaps adds some extra available properties on a type in the assembly for easy access to the assets and information regarding these.
This approach has the following benefits:
- Assets are kept safe and versioned in git.
- Normal pull request procedures can be used to review and detail changes to the assets. E.g. ground truth data has been re-annotated to remove incorrectly annotated data.
- Assets are readily available in other projects as versioned nuget packages.
Facilitating simple
git clone
anddotnet run
usage. For example for reproducibly building software and running machine learning training.
Problem
There is, however, one problem with this. If you like many machine learning
practitioners are relying on “manually”-annotated data where data can be updated
many times (hundreds or thousands of times).
Constantly expanding ground truth data or re-annotating to make
the data as clean and as full as possible. Since, the data stored in text
format can get quite large (e.g. 1 GB) and you have a lots of versions the
local nuget cache, e.g. C:\Users\<USERNAME>\.nuget\packages
, can quickly
consume 10s or 100s of GB. Not great if you are on a space limited laptop
or similar. Saving disk space would be nice.
The nuget package itself is just a zip-file
so the size of this is not the problem as such.
The cached unpacked package is. An example from
a “model”/build server is shown below, where the cache is about ~75 GB.
Solution
Fortunately, this can be done in a simple way by using MSBuild to
GZip each individual
EmbeddedResource
file.
Now MSBuild already has a ZipDirectory
task, but as the apt name implies this will zip an entire directory
into a single file. This is less than ideal since you won’t be able to easily
inspect an assembly with ildasm
for example to see which assets are embedded.
It also makes it a bit harder to get the individual files.
GZip’ing EmbeddedResource
can be done quite easily as shown in the below
csproj
file GZipEmbeddedResource.csproj
from the github example repository
GZipEmbeddedResource.
This example is using the new condensed csproj
format, but works in old style
csproj
format too. The approach works in both .NET Framework and .NET Core.
The example uses a rather large example text file enwik8
from
http://mattmahoney.net/dc/textdata.html
consisting of ~100 MB XML text.
As can be seen, the files that you want to embed are defined as you would normally
via EmbeddedResource
. Following that a new GZip
inline task is defined using
the RoslynCodeTaskFactory
.
This defines two parameters Files
and Result
,
which are both arrays of ITaskItem
, which are in this context are simply
files or rather file paths. Files
is the set of files we want to zip,
and Result
will contain the paths of these,
where .gz
has been suffixed to the path. Zipped files are written to disk,
but ignored by the .gitignore
file, this allows inspecting the zipped files
if needed.
This task is then used BeforeBuild
to zip all EmbeddedResource
files, and
replace this set of file paths with the zipped file paths,
so only the zipped files are embedded. Every time you build the project
and something has changed the files will be zipped.
This can take a while for large files, but since this is a
project only for these assets it is rarely a problem.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
<Project>
<Import Project="Sdk.props" Sdk="Microsoft.NET.Sdk" />
<PropertyGroup>
<TargetFramework>netstandard2.0</TargetFramework>
</PropertyGroup>
<ItemGroup>
<!-- Test file from: http://mattmahoney.net/dc/textdata.html -->
<EmbeddedResource Include="enwik8" />
</ItemGroup>
<Import Project="Sdk.targets" Sdk="Microsoft.NET.Sdk" />
<UsingTask TaskName="GZip"
TaskFactory="RoslynCodeTaskFactory"
AssemblyFile="$(MSBuildToolsPath)\Microsoft.Build.Tasks.Core.dll">
<ParameterGroup>
<Files ParameterType="Microsoft.Build.Framework.ITaskItem[]"
Required="true" />
<Result ParameterType="Microsoft.Build.Framework.ITaskItem[]"
Output="true" />
</ParameterGroup>
<Task>
<Using Namespace="System.IO" />
<Using Namespace="System.IO.Compression" />
<Code Type="Fragment" Language="cs">
<![CDATA[
if (Files.Length > 0)
{
Result = new TaskItem[Files.Length];
for (int i = 0; i < Files.Length; i++)
{
ITaskItem item = Files[i];
string sourcePath = item.GetMetadata("FullPath");
string sourceItemSpec = item.ItemSpec;
string destinationSuffix = ".gz";
string destinationPath = sourcePath + destinationSuffix;
string destinationItemSpec = sourceItemSpec + destinationSuffix;
Log.LogMessage(MessageImportance.Normal,
"EmbeddedResource Src : " + sourceItemSpec);
using (var sourceStream = File.OpenRead(sourcePath))
using (var destinationStream = File.OpenWrite(destinationPath))
using (var destinationGZip = new GZipStream(destinationStream,
CompressionLevel.Optimal))
{
sourceStream.CopyTo(destinationGZip);
}
var destinationItem = new TaskItem(destinationItemSpec);
Log.LogMessage(MessageImportance.Normal,
"EmbeddedResource GZip: " + destinationItem.ItemSpec);
Result[i] = destinationItem;
}
}
]]>
</Code>
</Task>
</UsingTask>
<Target Name="BeforeBuild">
<GZip Files="@(EmbeddedResource)">
<Output ItemName="GZipEmbeddedResource" TaskParameter="Result" />
</GZip>
<Message Text="Source EmbeddedResources: @(EmbeddedResource)" Importance="High" />
<Message Text="GZipped EmbeddedResources: @(GZipEmbeddedResource)" Importance="High" />
<ItemGroup>
<EmbeddedResource Remove="@(EmbeddedResource)" />
<EmbeddedResource Include="@(GZipEmbeddedResource)" />
</ItemGroup>
</Target>
</Project>
Example Build
When building with e.g. dotnet build
this will output the following,
where the list of files that have been GZipped is logged.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Microsoft (R) Build Engine version 16.7.0+7fb82e5b2 for .NET
Copyright (C) Microsoft Corporation. All rights reserved.
Determining projects to restore...
Restored C:\git\oss\GZipEmbeddedResource\src\GZipEmbeddedResource\GZipEmbeddedResource.csproj (in 199 ms).
Restored C:\git\oss\GZipEmbeddedResource\src\GZipEmbeddedResource.Test\GZipEmbeddedResource.Test.csproj (in 237 ms).
Source EmbeddedResources: enwik8
GZipped EmbeddedResources: enwik8.gz
GZipEmbeddedResource -> C:\git\oss\GZipEmbeddedResource\build\Libs_AnyCPU_Debug\GZipEmbeddedResource.dll
GZipEmbeddedResource.Test -> C:\git\oss\GZipEmbeddedResource\build\Tests\GZipEmbeddedResource.Test_AnyCPU_Debug\GZipEmbeddedResource.Test.dll
Build succeeded.
0 Warning(s)
0 Error(s)
Doing this for the enwik8
cuts the file size from ~100 MB to ~37 MB and hence
saves about 63%. However, for machine learning csv
-files which contain mainly
ASCII text consisting of alpha-numeric characters it is very common to see 90-97%
savings.
1
2
3
4
Length Name
------ ----
101128023 enwik8
37040314 enwik8.gz
The dll
in question will contain this file and hence have about the same size.
1
2
3
Length Name
------ ----
37046272 GZipEmbeddedResource.dll
Viewing this in ildasm
you can see easily see the embedded file and it’s size.
One thing to keep in mind here is that the text files being embedded will have the line endings that existed at the time of embedding so this might change based on git checkout policy and OS this is being build on.
That’s it. While you could define this task as a nuget package, keeping it as simple text in the project file allows easy customization like filtering or similar.