How to Load Data

Format-agnostic data ingestion with Loaders and EventStores

How to Load Data

This tutorial introduces the Loader pattern and EventStores - the foundation for bringing music data into TimeToAlign!

Learning Objectives: - Use Loaders to ingest music data from various formats - Navigate EventStores and access event data - Understand the harmonized schema that unifies different data sources

Prerequisites: - Basic Python and pandas knowledge - TimeToAlign! installed (pip install -e . from repository root)

Why Loaders Matter

Music data comes in many formats: MusicXML, MIDI, MEI, Humdrum, proprietary TSV exports, and more. Each format has its own structure, terminology, and quirks.

The problem: Without a unified approach, you’d need format-specific code for every data source, making cross-format analysis difficult and error-prone.

The TimeToAlign! solution: Loaders normalize heterogeneous formats into a consistent EventStore, enabling downstream processing without format-specific code.

MusicXML ─┐
MIDI ─────┼──> Loader ──> EventStore ──> DataFrame
TSV ──────┘

Setup

import pandas as pd

from timetoalign.loader.score.music21 import Music21Loader
from timetoalign.loader.score.partitura import PartituraLoader
from timetoalign.loader.score.tsv import TSVLoader
from timetoalign.testdata import ensure_data

DATA_DIR = ensure_data("vienna_1x22")

# Our test piece: Chopin Etude Op.10 No.3
CHOPIN_XML = DATA_DIR / "Chopin_op10_no3.musicxml"
CHOPIN_TSV = DATA_DIR / "ms3" / "chopin_op10_no3.notes.tsv"

CHOPIN_XML.name, CHOPIN_TSV.name

('Chopin_op10_no3.musicxml', 'chopin_op10_no3.notes.tsv')

The Loader Pattern

All TimeToAlign! loaders follow the same three-step pattern:

Create a loader instance
Load a file using .load(path)
Access the store containing EventStores

Let’s see this in action with three different loaders, all loading the same Chopin piece:

# Load from three different sources
tsv_loader = TSVLoader()
tsv_loader.load(CHOPIN_TSV)

partitura_loader = PartituraLoader()
partitura_loader.load(CHOPIN_XML)

music21_loader = Music21Loader()
music21_loader.load(CHOPIN_XML)

# All produce ScoreStores
{
    "TSV": type(tsv_loader.store).__name__,
    "Partitura": type(partitura_loader.store).__name__,
    "Music21": type(music21_loader.store).__name__,
}

{'TSV': 'ScoreStore', 'Partitura': 'ScoreStore', 'Music21': 'ScoreStore'}

Cross-Loader Validation

One of the key benefits of TimeToAlign! is that different loaders produce comparable output. Let’s verify that all three loaders found the same number of notes:

# Convert to DataFrames
tsv_df = tsv_loader.store.notes.to_dataframe()
partitura_df = partitura_loader.store.notes.to_dataframe()
music21_df = music21_loader.store.notes.to_dataframe()

# Count only Note events (not rests or other event types)
counts = {
    "TSV": len(tsv_df[tsv_df["event_type"] == "Note"]),
    "Partitura": len(partitura_df[partitura_df["event_type"] == "Note"]),
    "Music21": len(music21_df[music21_df["event_type"] == "Note"]),
}

# Validate against gold standard
assert all(c == 498 for c in counts.values()), f"Note count mismatch: {counts}"

pd.Series(counts, name="note_count")

TSV          498
Partitura    498
Music21      498
Name: note_count, dtype: int64

The EventStore

Each ScoreStore contains EventStores - efficient, PyArrow-backed tables that hold musical events.

Key characteristics: - High Performance: Built on Apache Arrow for fast columnar operations - Type Safety: Schema metadata preserves units and types - Pandas Interop: Easy conversion with .to_dataframe()

notes_store = tsv_loader.store.notes

{
    "type": type(notes_store).__name__,
    "n_events": len(notes_store),
    "storage": type(notes_store.table).__name__,
}

{'type': 'NoteEventData', 'n_events': 498, 'storage': 'Table'}

# Examine the schema with metadata
schema_info = []
for field in notes_store.table.schema:
    meta = field.metadata or {}
    meta_str = (
        ", ".join(f"{k.decode()}={v.decode()}" for k, v in meta.items()) if meta else ""
    )
    schema_info.append(
        {"name": field.name, "type": str(field.type)[:30], "metadata": meta_str}
    )

pd.DataFrame(schema_info)

	name	type	metadata
0	id	string
1	name	string
2	temporal_type	string
3	event_type	string
4	start	struct<value: double, numerato	unit=quarters
5	end	struct<value: double, numerato	unit=quarters
6	duration	struct<value: double, numerato	unit=quarters
7	duration_float	double
8	mc	int64	number_type=int64
9	mn	string
10	mc_onset	struct<num: int64 not null, de	number_type=fraction
11	mn_onset	struct<num: int64 not null, de	number_type=fraction
12	midi_pitch	struct<ep: int64, epc: int64>
13	specific_pitch	struct<gpc_int: int64, gpc_str
14	tpc	int64	number_type=int64, unit=fifths
15	octave	int64	number_type=int64
16	velocity	int64	number_type=int64
17	tied	int64
18	gracenote	string
19	chord_id	int64
20	voice	int64	number_type=int64
21	staff	int64	number_type=int64
22	part_id	string

The Harmonized Schema

TimeToAlign! uses a harmonized schema to represent events consistently across formats:

Column	Description
`id`	Unique identifier for the event
`temporal_type`	“instant” or “interval”
`event_type`	Type of event (Note, Rest, etc.)
`start`, `end`, `duration`	Temporal coordinates (as structs)
`duration_float`	Duration as a float for quick queries
`mc`, `mn`	Measure count and measure number
`midi_pitch`	MIDI pitch number (0-127)
`specific_pitch`	Pitch spelling information

# Show selected columns for the first few notes
display_cols = [
    "id",
    "name",
    "temporal_type",
    "event_type",
    "duration_float",
    "mc",
    "mn",
    "octave",
]
tsv_df[display_cols].head(10)

	id	name	temporal_type	event_type	duration_float	mc	mn	octave
0	note:000001	B3	interval	Note	0.50	1	1	3
1	note:000002	E2	interval	Note	0.25	2	2	2
2	note:000003	E2	interval	Note	1.00	2	2	2
3	note:000004	G#3	interval	Note	0.25	2	2	3
4	note:000005	E4	interval	Note	0.50	2	2	4
5	note:000006	B2	interval	Note	0.50	2	2	2
6	note:000007	B3	interval	Note	0.25	2	2	3
7	note:000008	G#3	interval	Note	0.25	2	2	3
8	note:000009	D#4	interval	Note	0.25	2	2	4
9	note:000010	B2	interval	Note	0.25	2	2	2

Pitch Information

The specific_pitch column contains rich pitch information as a struct. This preserves the enharmonic spelling (e.g., G# vs Ab) which is lost when using only MIDI pitch numbers.

# Extract spelled pitch information for the first note
first_note = tsv_df.iloc[0]

{
    "name": first_note["name"],
    "midi_pitch": first_note["midi_pitch"],
    "octave": first_note["octave"],
    "specific_pitch": first_note["specific_pitch"],
}

{'name': 'B3',
 'midi_pitch': {'ep': 59, 'epc': 11},
 'octave': 3,
 'specific_pitch': {'gpc_int': 6,
  'gpc_str': 'B',
  'acc': 0,
  'spc_int': 5,
  'spc_str': 'B',
  'sp': 'B3',
  'cents': 0.0}}

Duration Analysis

TimeToAlign! stores durations in quarter notes. Let’s analyze the rhythmic content of our piece:

# Duration distribution
tsv_df["duration_float"].value_counts().sort_index().to_frame("count")

	count
duration_float
0.00	4
0.25	391
0.50	42
0.75	2
1.00	54
2.00	5

# Summary statistics
tsv_df["duration_float"].describe()

count    498.000000
mean       0.369980
std        0.290333
min        0.000000
25%        0.250000
50%        0.250000
75%        0.250000
max        2.000000
Name: duration_float, dtype: float64

Navigating by Measure

The mc (measure count) and mn (measure number) columns allow easy navigation through the score. Note that mn is stored as a string (to support labels like “1a”, “1b”), so we convert to int for proper sorting:

# Notes per measure, sorted numerically
notes_per_measure = tsv_df.groupby("mn").size()

# Convert index to int for proper sorting (works for simple numeric measure numbers)
notes_per_measure.index = notes_per_measure.index.astype(int)
notes_per_measure = notes_per_measure.sort_index()

notes_per_measure.to_frame("notes")

	notes
mn
1	1
2	21
3	24
4	22
5	25
6	25
7	24
8	21
9	22
10	22
11	24
12	22
13	25
14	25
15	28
16	27
17	56
18	18
19	21
20	21
21	17
22	7

# Get all notes in a specific measure
measure_5 = tsv_df[tsv_df["mn"] == "5"]
measure_5[["name", "duration_float", "voice", "staff"]]

	name	duration_float	voice	staff
68	E2	0.25	1	2
69	E2	1.00	3	2
70	B3	0.25	3	1
71	A4	0.25	1	1
72	B2	0.50	1	2
73	E4	0.25	3	1
74	G#4	0.25	1	1
75	G#3	0.25	3	1
76	D#4	0.25	1	1
77	B2	0.25	1	2
78	B3	0.25	3	1
79	E4	0.25	1	1
80	B1	0.25	1	2
81	B1	1.00	3	2
82	A3	0.25	3	1
83	C#4	0.25	2	1
84	F#4	1.00	1	1
85	B2	0.50	1	2
86	B3	0.25	3	1
87	D#4	0.25	2	1
88	A3	0.25	3	1
89	C#4	0.25	2	1
90	B2	0.25	1	2
91	B3	0.25	3	1
92	D#4	0.25	2	1

Comparing Loader Outputs

While all loaders produce the same number of notes, there can be subtle differences in how they interpret the score. Let’s compare the first few notes:

# Compare ID schemes across loaders
pd.DataFrame(
    {
        "TSV_id": tsv_df["id"].head(5).values,
        "TSV_name": tsv_df["name"].head(5).values,
        "Partitura_id": partitura_df["id"].head(5).values,
        "Music21_id": music21_df["id"].head(5).values,
    }
)

	TSV_id	TSV_name	Partitura_id	Music21_id
0	note:000001	B3	note:000001	note:000001
1	note:000002	E2	note:000002	note:000002
2	note:000003	E2	note:000003	note:000003
3	note:000004	G#3	note:000004	note:000004
4	note:000005	E4	note:000005	note:000005

Unit Metadata

TimeToAlign! stores unit information in the PyArrow schema metadata. This ensures coordinates are always interpreted correctly:

# Extract unit metadata for temporal columns
temporal_cols = ["start", "end", "duration"]
{
    field.name: field.metadata.get(b"unit", b"(unknown)").decode()
    for field in notes_store.table.schema
    if field.name in temporal_cols and field.metadata
}

{'start': 'quarters', 'end': 'quarters', 'duration': 'quarters'}

Voice and Staff Information

In scores with multiple staves or multiple voices per staff, notes are distributed accordingly:

# Notes by staff and voice
tsv_df.groupby(["staff", "voice"]).size().unstack(fill_value=0)

voice	1	2	3
staff
1	112	50	153
2	142	2	39

Summary

In this tutorial, we learned:

The Loader Pattern: Create -> Load -> Access Bundle
Three Score Loaders: TSVLoader, PartituraLoader, Music21Loader
EventStore: PyArrow-backed, high-performance event storage
Harmonized Schema: Consistent columns across all loaders
Cross-Validation: Same piece from different sources yields same note count

Key Takeaway: > Loaders normalize heterogeneous formats into a consistent EventStore, > enabling downstream processing without format-specific code.

Next Steps

03_conversion_maps.ipynb: Learn how to convert between coordinate systems
04_building_timelines.ipynb: Create Timeline objects from EventStores

Exercise: Load Another Score

Task: Load the Beethoven String Quartet from beethoven_op18.mid and analyze its structure.

Hints: 1. Use PartituraLoader for MIDI files 2. Check how many parts are in the score 3. Count notes per part

Solution

# Load the Beethoven quartet
loader = PartituraLoader()
loader.load(DATA_DIR / "beethoven_op18.mid")

df = loader.store.notes.to_dataframe()
{"total_notes": len(df), "notes_per_part": df.groupby("part_id").size().to_dict()}

# Your solution here