Data Management
Data Management
The Leaderboard system uses a hybrid data storage approach for flexibility and maintainability.
Data Repository Structure
data-repo/
├── config.yaml # Configuration
├── theme.css # Custom theme (optional)
├── .leaderboard.db # LibSQL database
├── contributors/ # Contributor profiles (Markdown)
│ ├── alice.md
│ ├── bob.md
│ └── charlie.md
└── activities/ # Activity records (JSONL)
├── alice.jsonl
├── bob.jsonl
└── charlie.jsonl
Storage Strategies
| Data Type | Format | Import | Export | Rationale |
|---|---|---|---|---|
| Contributors | Markdown + YAML | ✅ | ✅ | Human-editable profiles with rich bio content |
| Activity Definitions | Database only | ❌ | ❌ | Managed by plugins, no manual editing needed |
| Activities | Sharded JSONL | ✅ | ✅ | Efficient for large datasets, easy per-user updates |
Contributor Profiles
Format
Markdown files with YAML frontmatter:
---
username: alice
name: Alice Smith
role: core
title: Senior Engineer
avatar_url: https://github.com/alice.png
social_profiles:
github: https://github.com/alice
linkedin: https://linkedin.com/in/alice
joining_date: 2020-03-15
meta:
team: backend
timezone: PST
---
Alice is a senior engineer specializing in backend systems and distributed computing.
She has been contributing to the project since its inception and leads the infrastructure team.
## Expertise
- Kubernetes
- PostgreSQL
- High-performance system design
Schema
Frontmatter fields:
username(string, required) - Unique identifiername(string) - Full namerole(string) - Role identifier (must matchconfig.yaml)title(string) - Job title or designationavatar_url(string) - Profile picture URLsocial_profiles(object) - Social media linksjoining_date(string, YYYY-MM-DD) - Join datemeta(object) - Custom metadata
Content: Bio in Markdown format
Manual Editing
Contributors can be added or edited manually:
- Create/edit file in
contributors/<username>.md - Write frontmatter and bio
- Commit and push changes
- Next build will include updates
Activities
Format
JSON Lines (JSONL), one file per contributor:
{"slug":"alice-pr-123","contributor":"alice","activity_definition":"pr_merged","title":"Fix auth bug","occured_at":"2024-01-15T10:30:00Z","link":"https://github.com/org/repo/pull/123","points":10}
{"slug":"alice-issue-45","contributor":"alice","activity_definition":"issue_opened","title":"Add rate limiting","occured_at":"2024-01-16T14:20:00Z","link":"https://github.com/org/repo/issues/45","points":5}
Schema
Each line is a JSON object with:
slug(string, required) - Unique identifiercontributor(string, required) - Usernameactivity_definition(string, required) - Activity type slugtitle(string) - Activity titleoccured_at(string, required) - ISO 8601 timestamplink(string) - URL to activitytext(string) - Additional text/descriptionpoints(number) - Points awardedmeta(object) - Custom metadata
Sharding
Activities are sharded by contributor for:
- Easier updates: Only update affected user's file
- Better performance: Parallel import/export
- Smaller diffs: Git history is cleaner
- Scalability: Works with thousands of contributors
Activity Definitions
Format
SQLite table in .leaderboard.db:
CREATE TABLE activity_definition (
slug VARCHAR PRIMARY KEY,
name VARCHAR NOT NULL,
description TEXT NOT NULL,
points SMALLINT,
icon VARCHAR
);
Management
Activity definitions are:
- Created by plugins during
setup()phase - Persisted in the database file
- Not exported to separate files
- Recreated if database is deleted
Example:
await ctx.db.execute(`
INSERT OR IGNORE INTO activity_definition
(slug, name, description, points, icon)
VALUES
('pr_merged', 'PR Merged', 'Pull request was merged', 10, 'git-merge')
`);
Database
LibSQL
Technology: SQLite-compatible database with modern features
Location: ${DATA_DIR}/.leaderboard.db
Purpose:
- Primary data store during build process
- Efficient querying for SSG
- Persistent storage for activity definitions
Access
The database is accessed by:
- Plugin Runner: Write data during scraping
- Next.js Build: Read data for SSG
Connection
import { createDatabase } from '@ohcnetwork/leaderboard-api';
const db = createDatabase('file:./data/.leaderboard.db');
Data Lifecycle
Import Phase
graph LR
MD[contributors/*.md] --> DB[(Database)]
JSONL[activities/*.jsonl] --> DB
Process:
- Read all Markdown files from
contributors/ - Parse YAML frontmatter and content
- Insert into
contributortable - Read all JSONL files from
activities/ - Parse each line as JSON
- Insert into
activitytable
Plugin Phase
graph LR
Plugin[Plugin] --> Setup[setup]
Setup --> AD[activity_definition]
Plugin --> Scrape[scrape]
Scrape --> API[External API]
API --> Activity[activity table]
Process:
- Execute
setup()for all plugins - Plugins populate
activity_definitiontable - Execute
scrape()for all plugins - Plugins fetch data from APIs
- Plugins insert into
activitytable
Export Phase
graph LR
DB[(Database)] --> MD[contributors/*.md]
DB --> JSONL[activities/*.jsonl]
Process:
- Read all contributors from database
- Serialize to Markdown with frontmatter
- Write to
contributors/<username>.md - Read all activities from database
- Group by contributor
- Serialize to JSONL
- Write to
activities/<username>.jsonl
Build Phase
graph LR
DB[(Database)] --> Next[Next.js SSG]
Next --> Static[Static Site]
Process:
- Next.js reads from LibSQL database
- Generate static pages with data
- Output HTML/CSS/JS files
Backup & Recovery
Full Backup
# Backup entire data repository
tar -czf data-backup-$(date +%Y%m%d).tar.gz data-repo/
Selective Backup
# Backup only contributors
tar -czf contributors-$(date +%Y%m%d).tar.gz data-repo/contributors/
# Backup only activities
tar -czf activities-$(date +%Y%m%d).tar.gz data-repo/activities/
Recovery
# Restore from backup
tar -xzf data-backup-20240101.tar.gz
Git History
Since data is in Git:
# Revert to previous state
git checkout HEAD~1 contributors/alice.md
# View history
git log --follow contributors/alice.md
Data Migration
From Legacy Format
If migrating from a different system:
// migration.js
import { createDatabase, contributorQueries } from '@ohcnetwork/leaderboard-api';
import { exportContributors } from '@leaderboard/plugin-runner';
const db = createDatabase(':memory:');
// Import legacy data
for (const legacyUser of legacyData) {
await contributorQueries.upsert(db, {
username: legacyUser.login,
name: legacyUser.name,
// ... map fields
});
}
// Export to new format
await exportContributors(db, './data-repo', logger);
Between Formats
To change activity format:
- Import activities to database
- Transform in database
- Export to new format
Data Validation
Automatic Validation
The plugin runner validates:
- Markdown frontmatter structure
- JSON validity in JSONL files
- Foreign key references
- Required fields
Manual Validation
Check contributor profiles:
# Check for missing usernames
grep -L "^username:" contributors/*.md
# Validate YAML frontmatter
for f in contributors/*.md; do
yq e '.' "$f" > /dev/null || echo "Invalid: $f"
done
Check activity files:
# Validate JSON lines
for f in activities/*.jsonl; do
jq empty "$f" 2>/dev/null || echo "Invalid JSON in: $f"
done
Performance Considerations
Import Performance
- Parallel processing: Contributors and activities imported concurrently
- Streaming: Large JSONL files processed line-by-line
- Batch inserts: Multiple activities inserted in transactions
Export Performance
- Sharded output: One file per contributor enables parallel export
- Incremental updates: Only modified files written
- Efficient queries: Database indexes optimize queries
Database Size
Typical sizes:
- Contributors: ~1 KB per contributor
- Activities: ~200 bytes per activity
- Database overhead: ~100 KB
Example: 100 contributors × 1000 activities = ~20 MB database
Best Practices
- Commit frequently: Small commits easier to review
- Use branches: Test changes before merging to main
- Validate before commit: Check data integrity
- Backup regularly: Automate backups in CI/CD
- Monitor database size: Archive old activities if needed
- Document custom fields: Explain
metafield usage