How to Load Data

Format-agnostic data ingestion with Loaders and EventStores

How to Load Data

This tutorial introduces the Loader pattern and EventStores - the foundation for bringing music data into TimeToAlign!

Learning Objectives: - Use Loaders to ingest music data from various formats - Navigate EventStores and access event data - Understand the harmonized schema that unifies different data sources

Prerequisites: - Basic Python and pandas knowledge - TimeToAlign! installed (pip install -e . from repository root)

Why Loaders Matter

Music data comes in many formats: MusicXML, MIDI, MEI, Humdrum, proprietary TSV exports, and more. Each format has its own structure, terminology, and quirks.

The problem: Without a unified approach, you’d need format-specific code for every data source, making cross-format analysis difficult and error-prone.

The TimeToAlign! solution: Loaders normalize heterogeneous formats into a consistent EventStore, enabling downstream processing without format-specific code.

MusicXML ─┐
MIDI ─────┼──> Loader ──> EventStore ──> DataFrame
TSV ──────┘

Setup

import pandas as pd

from timetoalign.loader.score.music21 import Music21Loader
from timetoalign.loader.score.partitura import PartituraLoader
from timetoalign.loader.score.tsv import TSVLoader
from timetoalign.testdata import ensure_data

DATA_DIR = ensure_data("vienna_1x22")

# Our test piece: Chopin Etude Op.10 No.3
CHOPIN_XML = DATA_DIR / "Chopin_op10_no3.musicxml"
CHOPIN_TSV = DATA_DIR / "ms3" / "chopin_op10_no3.notes.tsv"

CHOPIN_XML.name, CHOPIN_TSV.name
('Chopin_op10_no3.musicxml', 'chopin_op10_no3.notes.tsv')

The Loader Pattern

All TimeToAlign! loaders follow the same three-step pattern:

  1. Create a loader instance
  2. Load a file using .load(path)
  3. Access the store containing EventStores

Let’s see this in action with three different loaders, all loading the same Chopin piece:

# Load from three different sources
tsv_loader = TSVLoader()
tsv_loader.load(CHOPIN_TSV)

partitura_loader = PartituraLoader()
partitura_loader.load(CHOPIN_XML)

music21_loader = Music21Loader()
music21_loader.load(CHOPIN_XML)

# All produce ScoreStores
{
    "TSV": type(tsv_loader.store).__name__,
    "Partitura": type(partitura_loader.store).__name__,
    "Music21": type(music21_loader.store).__name__,
}
{'TSV': 'ScoreStore', 'Partitura': 'ScoreStore', 'Music21': 'ScoreStore'}

Cross-Loader Validation

One of the key benefits of TimeToAlign! is that different loaders produce comparable output. Let’s verify that all three loaders found the same number of notes:

# Convert to DataFrames
tsv_df = tsv_loader.store.notes.to_dataframe()
partitura_df = partitura_loader.store.notes.to_dataframe()
music21_df = music21_loader.store.notes.to_dataframe()

# Count only Note events (not rests or other event types)
counts = {
    "TSV": len(tsv_df[tsv_df["event_type"] == "Note"]),
    "Partitura": len(partitura_df[partitura_df["event_type"] == "Note"]),
    "Music21": len(music21_df[music21_df["event_type"] == "Note"]),
}

# Validate against gold standard
assert all(c == 498 for c in counts.values()), f"Note count mismatch: {counts}"

pd.Series(counts, name="note_count")
TSV          498
Partitura    498
Music21      498
Name: note_count, dtype: int64

The EventStore

Each ScoreStore contains EventStores - efficient, PyArrow-backed tables that hold musical events.

Key characteristics: - High Performance: Built on Apache Arrow for fast columnar operations - Type Safety: Schema metadata preserves units and types - Pandas Interop: Easy conversion with .to_dataframe()

notes_store = tsv_loader.store.notes

{
    "type": type(notes_store).__name__,
    "n_events": len(notes_store),
    "storage": type(notes_store.table).__name__,
}
{'type': 'NoteEventData', 'n_events': 498, 'storage': 'Table'}
# Examine the schema with metadata
schema_info = []
for field in notes_store.table.schema:
    meta = field.metadata or {}
    meta_str = (
        ", ".join(f"{k.decode()}={v.decode()}" for k, v in meta.items()) if meta else ""
    )
    schema_info.append(
        {"name": field.name, "type": str(field.type)[:30], "metadata": meta_str}
    )

pd.DataFrame(schema_info)
name type metadata
0 id string
1 name string
2 temporal_type string
3 event_type string
4 start struct<value: double, numerato unit=quarters
5 end struct<value: double, numerato unit=quarters
6 duration struct<value: double, numerato unit=quarters
7 duration_float double
8 mc int64 number_type=int64
9 mn string
10 mc_onset struct<num: int64 not null, de number_type=fraction
11 mn_onset struct<num: int64 not null, de number_type=fraction
12 midi_pitch struct<ep: int64, epc: int64>
13 specific_pitch struct<gpc_int: int64, gpc_str
14 tpc int64 number_type=int64, unit=fifths
15 octave int64 number_type=int64
16 velocity int64 number_type=int64
17 tied int64
18 gracenote string
19 chord_id int64
20 voice int64 number_type=int64
21 staff int64 number_type=int64
22 part_id string

The Harmonized Schema

TimeToAlign! uses a harmonized schema to represent events consistently across formats:

Column Description
id Unique identifier for the event
temporal_type “instant” or “interval”
event_type Type of event (Note, Rest, etc.)
start, end, duration Temporal coordinates (as structs)
duration_float Duration as a float for quick queries
mc, mn Measure count and measure number
midi_pitch MIDI pitch number (0-127)
specific_pitch Pitch spelling information
# Show selected columns for the first few notes
display_cols = [
    "id",
    "name",
    "temporal_type",
    "event_type",
    "duration_float",
    "mc",
    "mn",
    "octave",
]
tsv_df[display_cols].head(10)
id name temporal_type event_type duration_float mc mn octave
0 note:000001 B3 interval Note 0.50 1 1 3
1 note:000002 E2 interval Note 0.25 2 2 2
2 note:000003 E2 interval Note 1.00 2 2 2
3 note:000004 G#3 interval Note 0.25 2 2 3
4 note:000005 E4 interval Note 0.50 2 2 4
5 note:000006 B2 interval Note 0.50 2 2 2
6 note:000007 B3 interval Note 0.25 2 2 3
7 note:000008 G#3 interval Note 0.25 2 2 3
8 note:000009 D#4 interval Note 0.25 2 2 4
9 note:000010 B2 interval Note 0.25 2 2 2

Pitch Information

The specific_pitch column contains rich pitch information as a struct. This preserves the enharmonic spelling (e.g., G# vs Ab) which is lost when using only MIDI pitch numbers.

# Extract spelled pitch information for the first note
first_note = tsv_df.iloc[0]

{
    "name": first_note["name"],
    "midi_pitch": first_note["midi_pitch"],
    "octave": first_note["octave"],
    "specific_pitch": first_note["specific_pitch"],
}
{'name': 'B3',
 'midi_pitch': {'ep': 59, 'epc': 11},
 'octave': 3,
 'specific_pitch': {'gpc_int': 6,
  'gpc_str': 'B',
  'acc': 0,
  'spc_int': 5,
  'spc_str': 'B',
  'sp': 'B3',
  'cents': 0.0}}

Duration Analysis

TimeToAlign! stores durations in quarter notes. Let’s analyze the rhythmic content of our piece:

# Duration distribution
tsv_df["duration_float"].value_counts().sort_index().to_frame("count")
count
duration_float
0.00 4
0.25 391
0.50 42
0.75 2
1.00 54
2.00 5
# Summary statistics
tsv_df["duration_float"].describe()
count    498.000000
mean       0.369980
std        0.290333
min        0.000000
25%        0.250000
50%        0.250000
75%        0.250000
max        2.000000
Name: duration_float, dtype: float64

Comparing Loader Outputs

While all loaders produce the same number of notes, there can be subtle differences in how they interpret the score. Let’s compare the first few notes:

# Compare ID schemes across loaders
pd.DataFrame(
    {
        "TSV_id": tsv_df["id"].head(5).values,
        "TSV_name": tsv_df["name"].head(5).values,
        "Partitura_id": partitura_df["id"].head(5).values,
        "Music21_id": music21_df["id"].head(5).values,
    }
)
TSV_id TSV_name Partitura_id Music21_id
0 note:000001 B3 note:000001 note:000001
1 note:000002 E2 note:000002 note:000002
2 note:000003 E2 note:000003 note:000003
3 note:000004 G#3 note:000004 note:000004
4 note:000005 E4 note:000005 note:000005

Unit Metadata

TimeToAlign! stores unit information in the PyArrow schema metadata. This ensures coordinates are always interpreted correctly:

# Extract unit metadata for temporal columns
temporal_cols = ["start", "end", "duration"]
{
    field.name: field.metadata.get(b"unit", b"(unknown)").decode()
    for field in notes_store.table.schema
    if field.name in temporal_cols and field.metadata
}
{'start': 'quarters', 'end': 'quarters', 'duration': 'quarters'}

Voice and Staff Information

In scores with multiple staves or multiple voices per staff, notes are distributed accordingly:

# Notes by staff and voice
tsv_df.groupby(["staff", "voice"]).size().unstack(fill_value=0)
voice 1 2 3
staff
1 112 50 153
2 142 2 39

Summary

In this tutorial, we learned:

  1. The Loader Pattern: Create -> Load -> Access Bundle
  2. Three Score Loaders: TSVLoader, PartituraLoader, Music21Loader
  3. EventStore: PyArrow-backed, high-performance event storage
  4. Harmonized Schema: Consistent columns across all loaders
  5. Cross-Validation: Same piece from different sources yields same note count

Key Takeaway: > Loaders normalize heterogeneous formats into a consistent EventStore, > enabling downstream processing without format-specific code.

Next Steps

  • 03_conversion_maps.ipynb: Learn how to convert between coordinate systems
  • 04_building_timelines.ipynb: Create Timeline objects from EventStores

Exercise: Load Another Score

Task: Load the Beethoven String Quartet from beethoven_op18.mid and analyze its structure.

Hints: 1. Use PartituraLoader for MIDI files 2. Check how many parts are in the score 3. Count notes per part

Solution
# Load the Beethoven quartet
loader = PartituraLoader()
loader.load(DATA_DIR / "beethoven_op18.mid")

df = loader.store.notes.to_dataframe()
{"total_notes": len(df), "notes_per_part": df.groupby("part_id").size().to_dict()}
# Your solution here