Format-agnostic data ingestion with Loaders and EventStores
How to Load Data
This tutorial introduces the Loader pattern and EventStores - the foundation for bringing music data into TimeToAlign!
Learning Objectives: - Use Loaders to ingest music data from various formats - Navigate EventStores and access event data - Understand the harmonized schema that unifies different data sources
Prerequisites: - Basic Python and pandas knowledge - TimeToAlign! installed (pip install -e . from repository root)
Why Loaders Matter
Music data comes in many formats: MusicXML, MIDI, MEI, Humdrum, proprietary TSV exports, and more. Each format has its own structure, terminology, and quirks.
The problem: Without a unified approach, you’d need format-specific code for every data source, making cross-format analysis difficult and error-prone.
The TimeToAlign! solution: Loaders normalize heterogeneous formats into a consistent EventStore, enabling downstream processing without format-specific code.
All TimeToAlign! loaders follow the same three-step pattern:
Create a loader instance
Load a file using .load(path)
Access the store containing EventStores
Let’s see this in action with three different loaders, all loading the same Chopin piece:
# Load from three different sourcestsv_loader = TSVLoader()tsv_loader.load(CHOPIN_TSV)partitura_loader = PartituraLoader()partitura_loader.load(CHOPIN_XML)music21_loader = Music21Loader()music21_loader.load(CHOPIN_XML)# All produce ScoreStores{"TSV": type(tsv_loader.store).__name__,"Partitura": type(partitura_loader.store).__name__,"Music21": type(music21_loader.store).__name__,}
One of the key benefits of TimeToAlign! is that different loaders produce comparable output. Let’s verify that all three loaders found the same number of notes:
# Convert to DataFramestsv_df = tsv_loader.store.notes.to_dataframe()partitura_df = partitura_loader.store.notes.to_dataframe()music21_df = music21_loader.store.notes.to_dataframe()# Count only Note events (not rests or other event types)counts = {"TSV": len(tsv_df[tsv_df["event_type"] =="Note"]),"Partitura": len(partitura_df[partitura_df["event_type"] =="Note"]),"Music21": len(music21_df[music21_df["event_type"] =="Note"]),}# Validate against gold standardassertall(c ==498for c in counts.values()), f"Note count mismatch: {counts}"pd.Series(counts, name="note_count")
Each ScoreStore contains EventStores - efficient, PyArrow-backed tables that hold musical events.
Key characteristics: - High Performance: Built on Apache Arrow for fast columnar operations - Type Safety: Schema metadata preserves units and types - Pandas Interop: Easy conversion with .to_dataframe()
# Examine the schema with metadataschema_info = []for field in notes_store.table.schema: meta = field.metadata or {} meta_str = (", ".join(f"{k.decode()}={v.decode()}"for k, v in meta.items()) if meta else"" ) schema_info.append( {"name": field.name, "type": str(field.type)[:30], "metadata": meta_str} )pd.DataFrame(schema_info)
name
type
metadata
0
id
string
1
name
string
2
temporal_type
string
3
event_type
string
4
start
struct<value: double, numerato
unit=quarters
5
end
struct<value: double, numerato
unit=quarters
6
duration
struct<value: double, numerato
unit=quarters
7
duration_float
double
8
mc
int64
number_type=int64
9
mn
string
10
mc_onset
struct<num: int64 not null, de
number_type=fraction
11
mn_onset
struct<num: int64 not null, de
number_type=fraction
12
midi_pitch
struct<ep: int64, epc: int64>
13
specific_pitch
struct<gpc_int: int64, gpc_str
14
tpc
int64
number_type=int64, unit=fifths
15
octave
int64
number_type=int64
16
velocity
int64
number_type=int64
17
tied
int64
18
gracenote
string
19
chord_id
int64
20
voice
int64
number_type=int64
21
staff
int64
number_type=int64
22
part_id
string
The Harmonized Schema
TimeToAlign! uses a harmonized schema to represent events consistently across formats:
Column
Description
id
Unique identifier for the event
temporal_type
“instant” or “interval”
event_type
Type of event (Note, Rest, etc.)
start, end, duration
Temporal coordinates (as structs)
duration_float
Duration as a float for quick queries
mc, mn
Measure count and measure number
midi_pitch
MIDI pitch number (0-127)
specific_pitch
Pitch spelling information
# Show selected columns for the first few notesdisplay_cols = ["id","name","temporal_type","event_type","duration_float","mc","mn","octave",]tsv_df[display_cols].head(10)
id
name
temporal_type
event_type
duration_float
mc
mn
octave
0
note:000001
B3
interval
Note
0.50
1
1
3
1
note:000002
E2
interval
Note
0.25
2
2
2
2
note:000003
E2
interval
Note
1.00
2
2
2
3
note:000004
G#3
interval
Note
0.25
2
2
3
4
note:000005
E4
interval
Note
0.50
2
2
4
5
note:000006
B2
interval
Note
0.50
2
2
2
6
note:000007
B3
interval
Note
0.25
2
2
3
7
note:000008
G#3
interval
Note
0.25
2
2
3
8
note:000009
D#4
interval
Note
0.25
2
2
4
9
note:000010
B2
interval
Note
0.25
2
2
2
Pitch Information
The specific_pitch column contains rich pitch information as a struct. This preserves the enharmonic spelling (e.g., G# vs Ab) which is lost when using only MIDI pitch numbers.
# Extract spelled pitch information for the first notefirst_note = tsv_df.iloc[0]{"name": first_note["name"],"midi_pitch": first_note["midi_pitch"],"octave": first_note["octave"],"specific_pitch": first_note["specific_pitch"],}
count 498.000000
mean 0.369980
std 0.290333
min 0.000000
25% 0.250000
50% 0.250000
75% 0.250000
max 2.000000
Name: duration_float, dtype: float64
Navigating by Measure
The mc (measure count) and mn (measure number) columns allow easy navigation through the score. Note that mn is stored as a string (to support labels like “1a”, “1b”), so we convert to int for proper sorting:
# Notes per measure, sorted numericallynotes_per_measure = tsv_df.groupby("mn").size()# Convert index to int for proper sorting (works for simple numeric measure numbers)notes_per_measure.index = notes_per_measure.index.astype(int)notes_per_measure = notes_per_measure.sort_index()notes_per_measure.to_frame("notes")
notes
mn
1
1
2
21
3
24
4
22
5
25
6
25
7
24
8
21
9
22
10
22
11
24
12
22
13
25
14
25
15
28
16
27
17
56
18
18
19
21
20
21
21
17
22
7
# Get all notes in a specific measuremeasure_5 = tsv_df[tsv_df["mn"] =="5"]measure_5[["name", "duration_float", "voice", "staff"]]
name
duration_float
voice
staff
68
E2
0.25
1
2
69
E2
1.00
3
2
70
B3
0.25
3
1
71
A4
0.25
1
1
72
B2
0.50
1
2
73
E4
0.25
3
1
74
G#4
0.25
1
1
75
G#3
0.25
3
1
76
D#4
0.25
1
1
77
B2
0.25
1
2
78
B3
0.25
3
1
79
E4
0.25
1
1
80
B1
0.25
1
2
81
B1
1.00
3
2
82
A3
0.25
3
1
83
C#4
0.25
2
1
84
F#4
1.00
1
1
85
B2
0.50
1
2
86
B3
0.25
3
1
87
D#4
0.25
2
1
88
A3
0.25
3
1
89
C#4
0.25
2
1
90
B2
0.25
1
2
91
B3
0.25
3
1
92
D#4
0.25
2
1
Comparing Loader Outputs
While all loaders produce the same number of notes, there can be subtle differences in how they interpret the score. Let’s compare the first few notes:
# Compare ID schemes across loaderspd.DataFrame( {"TSV_id": tsv_df["id"].head(5).values,"TSV_name": tsv_df["name"].head(5).values,"Partitura_id": partitura_df["id"].head(5).values,"Music21_id": music21_df["id"].head(5).values, })
TSV_id
TSV_name
Partitura_id
Music21_id
0
note:000001
B3
note:000001
note:000001
1
note:000002
E2
note:000002
note:000002
2
note:000003
E2
note:000003
note:000003
3
note:000004
G#3
note:000004
note:000004
4
note:000005
E4
note:000005
note:000005
Unit Metadata
TimeToAlign! stores unit information in the PyArrow schema metadata. This ensures coordinates are always interpreted correctly:
# Extract unit metadata for temporal columnstemporal_cols = ["start", "end", "duration"]{ field.name: field.metadata.get(b"unit", b"(unknown)").decode()for field in notes_store.table.schemaif field.name in temporal_cols and field.metadata}