tfrere HF Staff commited on
Commit
2b16052
·
1 Parent(s): aaaea48

feat: add txt/docx export scripts and fix MDX angle bracket parsing

Browse files

- Add export-txt.mjs and export-docx.mjs for alternative export formats
- Fix MDX parser error by escaping angle brackets before numbers (e.g., <30B → <30B)
- Update article content from Notion
- Minor improvements to export-pdf.mjs and screenshot-elements.mjs

.gitignore CHANGED
@@ -43,3 +43,4 @@ app/public/data/**/*
43
  .temp-*/
44
  .backup-*/
45
 
 
 
43
  .temp-*/
44
  .backup-*/
45
 
46
+ *.docx
app/package-lock.json CHANGED
Binary files a/app/package-lock.json and b/app/package-lock.json differ
 
app/package.json CHANGED
Binary files a/app/package.json and b/app/package.json differ
 
app/scripts/README-TXT-EXPORT.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TXT Export for Book Publishing
2
+
3
+ This script exports the article to a simple text format suitable for book publishing software, with custom tags for special elements.
4
+
5
+ ## Usage
6
+
7
+ ```bash
8
+ npm run export:txt
9
+ ```
10
+
11
+ Or with custom filename:
12
+
13
+ ```bash
14
+ node scripts/export-txt.mjs --filename=my-article
15
+ ```
16
+
17
+ ## Output
18
+
19
+ The script generates a `.txt` file in the `dist/` folder with the following format:
20
+
21
+ ### Text Tags
22
+
23
+ #### Figures/Images
24
+ ```
25
+ <f> NAME ANCHOR DESCRIPTION </f>
26
+ ```
27
+ - **NAME**: Figure name (e.g., "Figure 1")
28
+ - **ANCHOR**: HTML anchor/ID for cross-references
29
+ - **DESCRIPTION**: Figure caption/description
30
+
31
+ Example:
32
+ ```
33
+ <f>Figure 1 placeholder-image A placeholder image description</f>
34
+ ```
35
+
36
+ #### Tables
37
+ ```
38
+ <t> NAME DESCRIPTION </t>
39
+ ```
40
+ - **NAME**: Table name (e.g., "Table 1")
41
+ - **DESCRIPTION**: Table caption/description
42
+
43
+ Example:
44
+ ```
45
+ <t>Table 1 | Comparison of model architectures</t>
46
+ ```
47
+
48
+ #### Code Blocks
49
+ ```
50
+ <c> CODE | DESCRIPTION </c>
51
+ ```
52
+ - **CODE**: The actual code content
53
+ - **DESCRIPTION**: Optional description or caption
54
+
55
+ Example:
56
+ ```
57
+ <c>function hello() {
58
+ console.log("Hello world");
59
+ } | JavaScript example function</c>
60
+ ```
61
+
62
+ #### Inline Code
63
+ ```
64
+ <ic> CODE </ic>
65
+ ```
66
+ Example:
67
+ ```
68
+ Use the <ic>npm install</ic> command to install dependencies.
69
+ ```
70
+
71
+ #### LaTeX Formulas
72
+ ```
73
+ <l> katex-number </l>
74
+ ```
75
+ References to exported KaTeX formula PNGs, numbered chronologically.
76
+
77
+ Example:
78
+ ```
79
+ The equation <l>katex-1</l> shows the relationship...
80
+ ```
81
+
82
+ The corresponding PNG files should be exported separately (e.g., `katex-1.png`, `katex-2.png`, etc.)
83
+
84
+ ## Standard Markdown Elements
85
+
86
+ The script also preserves standard markdown formatting:
87
+
88
+ - **Headings**: `# ## ###` etc.
89
+ - **Paragraphs**: Plain text with line breaks
90
+ - **Lists**: Bulleted (`-`) and numbered (`1. 2. 3.`)
91
+ - **Blockquotes**: `> Text`
92
+
93
+ ## How It Works
94
+
95
+ 1. **Build**: Builds the Astro site (if not already built)
96
+ 2. **Launch**: Starts a preview server
97
+ 3. **Extract**: Uses Playwright to load the page and extract content from the DOM
98
+ 4. **Convert**: Transforms HTML elements into the custom tag format
99
+ 5. **Export**: Writes the result to `dist/article.txt`
100
+
101
+ ## Example Output
102
+
103
+ ```
104
+ # Introduction
105
+
106
+ This is a paragraph with <ic>inline code</ic> and a reference to <l>katex-1</l>.
107
+
108
+ <f>Figure 1 training-loss Training loss over time for SmolLM3</f>
109
+
110
+ ## Methods
111
+
112
+ We used the following approach:
113
+
114
+ - First step
115
+ - Second step
116
+ - Third step
117
+
118
+ <c>def train_model():
119
+ return model | Python training function</c>
120
+
121
+ <t>Table 1 | Hyperparameters used in training</t>
122
+ ```
123
+
124
+ ## Notes
125
+
126
+ - The script reuses the same infrastructure as PDF export (`export-pdf.mjs`)
127
+ - It's designed to work with the existing Astro build pipeline
128
+ - All custom components (Image, HtmlEmbed, Note, etc.) are properly handled
129
+ - KaTeX formulas are numbered sequentially for easy reference to exported PNGs
app/scripts/export-docx.mjs ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env node
2
+
3
+ /**
4
+ * Export TXT to DOCX format for book publishing
5
+ *
6
+ * This script converts the exported TXT file to a simple DOCX document:
7
+ * - Preserves headings, paragraphs, lists
8
+ * - Keeps custom tags (<f>, <t>, <l>, <ic>, <il>, <n>) as-is for manual processing
9
+ * - Formats code blocks
10
+ * - Creates a clean document ready for further editing
11
+ *
12
+ * Usage:
13
+ * node scripts/export-docx.mjs [--input=path/to/file.txt]
14
+ * npm run export:docx
15
+ */
16
+
17
+ import { Document, Packer, Paragraph, TextRun, HeadingLevel, AlignmentType } from 'docx';
18
+ import { promises as fs } from 'node:fs';
19
+ import { resolve } from 'node:path';
20
+ import process from 'node:process';
21
+
22
+ function parseArgs(argv) {
23
+ const out = {};
24
+ for (const arg of argv.slice(2)) {
25
+ if (!arg.startsWith('--')) continue;
26
+ const [k, v] = arg.replace(/^--/, '').split('=');
27
+ out[k] = v === undefined ? true : v;
28
+ }
29
+ return out;
30
+ }
31
+
32
+ function detectHeadingLevel(line) {
33
+ const match = line.match(/^(#{1,6})\s+(.+)$/);
34
+ if (!match) return null;
35
+ const level = match[1].length;
36
+ const text = match[2].trim();
37
+ return { level, text };
38
+ }
39
+
40
+ function parseInlineFormatting(text) {
41
+ const runs = [];
42
+ let currentPos = 0;
43
+
44
+ // Parse inline tags: <ic>, <il>, <n> (keep as-is with special formatting)
45
+ const tagRegex = /<(ic|il|n)>([^<]*)<\/\1>/g;
46
+ let match;
47
+
48
+ while ((match = tagRegex.exec(text)) !== null) {
49
+ // Add text before the tag
50
+ if (match.index > currentPos) {
51
+ const beforeText = text.substring(currentPos, match.index);
52
+ if (beforeText) {
53
+ runs.push(new TextRun(beforeText));
54
+ }
55
+ }
56
+
57
+ // Add the tagged content with special formatting
58
+ const tagType = match[1];
59
+ const content = match[2];
60
+
61
+ if (tagType === 'ic') {
62
+ // Inline code: monospace, gray background
63
+ runs.push(new TextRun({
64
+ text: content,
65
+ font: 'Courier New',
66
+ color: '333333',
67
+ shading: { fill: 'E8E8E8', type: 'clear' }
68
+ }));
69
+ } else if (tagType === 'il') {
70
+ // Inline LaTeX: italic, keep as-is
71
+ runs.push(new TextRun({
72
+ text: content,
73
+ italics: true,
74
+ color: '0066CC'
75
+ }));
76
+ } else if (tagType === 'n') {
77
+ // Note: keep tag for manual processing
78
+ runs.push(new TextRun({
79
+ text: `<n>${content}</n>`,
80
+ color: 'FF6B00',
81
+ italics: true
82
+ }));
83
+ }
84
+
85
+ currentPos = match.index + match[0].length;
86
+ }
87
+
88
+ // Add remaining text
89
+ if (currentPos < text.length) {
90
+ runs.push(new TextRun(text.substring(currentPos)));
91
+ }
92
+
93
+ return runs.length > 0 ? runs : [new TextRun(text)];
94
+ }
95
+
96
+ async function convertTxtToDocx(txtPath, outputPath) {
97
+ console.log(`📖 Reading TXT file: ${txtPath}`);
98
+ const content = await fs.readFile(txtPath, 'utf-8');
99
+ const lines = content.split('\n');
100
+
101
+ const paragraphs = [];
102
+ let inCodeBlock = false;
103
+ let codeLines = [];
104
+
105
+ for (let i = 0; i < lines.length; i++) {
106
+ const line = lines[i];
107
+
108
+ // Skip empty lines unless in code block
109
+ if (!line.trim() && !inCodeBlock) {
110
+ paragraphs.push(new Paragraph({ text: '' }));
111
+ continue;
112
+ }
113
+
114
+ // Handle code blocks <c>...</c>
115
+ if (line.trim().startsWith('<c>')) {
116
+ inCodeBlock = true;
117
+ codeLines = [];
118
+ const firstLine = line.replace(/^<c>\s*/, '');
119
+ if (firstLine && !firstLine.startsWith('</c>')) {
120
+ codeLines.push(firstLine);
121
+ }
122
+ continue;
123
+ }
124
+
125
+ if (line.trim().endsWith('</c>')) {
126
+ const lastLine = line.replace(/<\/c>\s*$/, '');
127
+ if (lastLine) codeLines.push(lastLine);
128
+
129
+ // Add code block as paragraph(s)
130
+ if (codeLines.length > 0) {
131
+ paragraphs.push(new Paragraph({
132
+ text: codeLines.join('\n'),
133
+ font: 'Courier New',
134
+ size: 20,
135
+ shading: { fill: 'F5F5F5', type: 'clear' },
136
+ spacing: { before: 200, after: 200 }
137
+ }));
138
+ }
139
+
140
+ inCodeBlock = false;
141
+ codeLines = [];
142
+ continue;
143
+ }
144
+
145
+ if (inCodeBlock) {
146
+ codeLines.push(line);
147
+ continue;
148
+ }
149
+
150
+ // Handle figure tags <f>...</f>
151
+ if (line.trim().startsWith('<f>')) {
152
+ paragraphs.push(new Paragraph({
153
+ children: [new TextRun({
154
+ text: line.trim(),
155
+ color: '0066CC',
156
+ bold: true
157
+ })],
158
+ spacing: { before: 200, after: 100 }
159
+ }));
160
+ continue;
161
+ }
162
+
163
+ // Handle table tags <t>...</t>
164
+ if (line.trim().startsWith('<t>')) {
165
+ paragraphs.push(new Paragraph({
166
+ children: [new TextRun({
167
+ text: line.trim(),
168
+ color: '009688',
169
+ bold: true
170
+ })],
171
+ spacing: { before: 200, after: 100 }
172
+ }));
173
+ continue;
174
+ }
175
+
176
+ // Handle LaTeX display tags <l>...</l>
177
+ if (line.trim().startsWith('<l>')) {
178
+ paragraphs.push(new Paragraph({
179
+ children: [new TextRun({
180
+ text: line.trim(),
181
+ color: '9C27B0',
182
+ bold: true
183
+ })],
184
+ alignment: AlignmentType.CENTER,
185
+ spacing: { before: 200, after: 200 }
186
+ }));
187
+ continue;
188
+ }
189
+
190
+ // Handle headings
191
+ const heading = detectHeadingLevel(line);
192
+ if (heading) {
193
+ const headingLevels = {
194
+ 1: HeadingLevel.HEADING_1,
195
+ 2: HeadingLevel.HEADING_2,
196
+ 3: HeadingLevel.HEADING_3,
197
+ 4: HeadingLevel.HEADING_4,
198
+ 5: HeadingLevel.HEADING_5,
199
+ 6: HeadingLevel.HEADING_6
200
+ };
201
+
202
+ paragraphs.push(new Paragraph({
203
+ text: heading.text,
204
+ heading: headingLevels[heading.level],
205
+ spacing: { before: 400, after: 200 }
206
+ }));
207
+ continue;
208
+ }
209
+
210
+ // Handle list items
211
+ if (line.trim().startsWith('- ')) {
212
+ const text = line.trim().substring(2);
213
+ paragraphs.push(new Paragraph({
214
+ children: parseInlineFormatting(text),
215
+ bullet: { level: 0 },
216
+ spacing: { before: 100, after: 100 }
217
+ }));
218
+ continue;
219
+ }
220
+
221
+ // Handle numbered lists
222
+ const numberedMatch = line.trim().match(/^(\d+)\.\s+(.+)$/);
223
+ if (numberedMatch) {
224
+ const text = numberedMatch[2];
225
+ paragraphs.push(new Paragraph({
226
+ children: parseInlineFormatting(text),
227
+ numbering: { reference: 'default-numbering', level: 0 },
228
+ spacing: { before: 100, after: 100 }
229
+ }));
230
+ continue;
231
+ }
232
+
233
+ // Handle blockquotes
234
+ if (line.trim().startsWith('> ')) {
235
+ const text = line.trim().substring(2);
236
+ paragraphs.push(new Paragraph({
237
+ children: parseInlineFormatting(text),
238
+ italics: true,
239
+ indent: { left: 720 },
240
+ spacing: { before: 200, after: 200 }
241
+ }));
242
+ continue;
243
+ }
244
+
245
+ // Regular paragraph
246
+ if (line.trim()) {
247
+ paragraphs.push(new Paragraph({
248
+ children: parseInlineFormatting(line.trim()),
249
+ spacing: { before: 100, after: 100 }
250
+ }));
251
+ }
252
+ }
253
+
254
+ console.log(`📝 Creating DOCX with ${paragraphs.length} paragraphs...`);
255
+
256
+ const doc = new Document({
257
+ sections: [{
258
+ properties: {},
259
+ children: paragraphs
260
+ }]
261
+ });
262
+
263
+ console.log(`💾 Writing DOCX to: ${outputPath}`);
264
+ const buffer = await Packer.toBuffer(doc);
265
+ await fs.writeFile(outputPath, buffer);
266
+
267
+ console.log(`✅ DOCX created successfully!`);
268
+ }
269
+
270
+ async function main() {
271
+ const cwd = process.cwd();
272
+ const args = parseArgs(process.argv);
273
+
274
+ const inputPath = args.input || resolve(cwd, 'dist', 'the-smol-training-playbook-the-secrets-to-building-world-class-llms.txt');
275
+ const outputPath = args.output || inputPath.replace('.txt', '.docx');
276
+
277
+ // Check if input exists
278
+ try {
279
+ await fs.access(inputPath);
280
+ } catch {
281
+ console.error(`❌ Error: Input file not found: ${inputPath}`);
282
+ console.error(' Run "npm run export:txt" first to generate the TXT file.');
283
+ process.exit(1);
284
+ }
285
+
286
+ await convertTxtToDocx(inputPath, outputPath);
287
+
288
+ // Also copy to public folder
289
+ const publicPath = outputPath.replace('/dist/', '/public/');
290
+ try {
291
+ await fs.mkdir(resolve(cwd, 'public'), { recursive: true });
292
+ await fs.copyFile(outputPath, publicPath);
293
+ console.log(`✅ DOCX copied to: ${publicPath}`);
294
+ } catch (e) {
295
+ console.warn('Unable to copy DOCX to public/:', e?.message || e);
296
+ }
297
+ }
298
+
299
+ main().catch((err) => {
300
+ console.error('❌ Error:', err.message);
301
+ console.error(err);
302
+ process.exit(1);
303
+ });
app/scripts/export-pdf.mjs CHANGED
@@ -246,6 +246,18 @@ iframe, embed, object { width: 100% !important; max-width: 100% !important; heig
246
  .html-embed, .html-embed__card { max-width: 100% !important; width: 100% !important; }
247
  .html-embed__card > div[id^="frag-"] { width: 100% !important; max-width: 100% !important; }
248
 
 
 
 
 
 
 
 
 
 
 
 
 
249
  /* Banner centering */
250
  .hero .points { mix-blend-mode: normal !important; }
251
  .hero-banner, .hero .hero-banner, [class*="hero-banner"] {
@@ -282,8 +294,8 @@ iframe, embed, object { width: 100% !important; max-width: 100% !important; heig
282
  width: auto !important;
283
  height: auto !important;
284
  max-width: 100% !important;
285
- /* Limit height to fit on a single page (~250mm = 945px at 96dpi, minus margins) */
286
- max-height: 800px !important;
287
  display: block !important;
288
  object-fit: contain !important;
289
  margin-left: auto !important;
@@ -727,8 +739,8 @@ async function main() {
727
 
728
  const browser = await chromium.launch({ headless: true });
729
  try {
730
- // Use 2x scale factor for retina-quality screenshots
731
- const deviceScaleFactor = 2;
732
  const context = await browser.newContext({
733
  deviceScaleFactor
734
  });
 
246
  .html-embed, .html-embed__card { max-width: 100% !important; width: 100% !important; }
247
  .html-embed__card > div[id^="frag-"] { width: 100% !important; max-width: 100% !important; }
248
 
249
+ /* Wide mode: remove blur/mask effects for print */
250
+ .wide, .html-embed--wide {
251
+ -webkit-mask: none !important;
252
+ mask: none !important;
253
+ background: transparent !important;
254
+ padding: 0 !important;
255
+ width: 100% !important;
256
+ margin-left: 0 !important;
257
+ transform: none !important;
258
+ border-radius: 0 !important;
259
+ }
260
+
261
  /* Banner centering */
262
  .hero .points { mix-blend-mode: normal !important; }
263
  .hero-banner, .hero .hero-banner, [class*="hero-banner"] {
 
294
  width: auto !important;
295
  height: auto !important;
296
  max-width: 100% !important;
297
+ /* Limit height to fit on a single page (~269mm printable = ~1015px, with margin) */
298
+ max-height: 950px !important;
299
  display: block !important;
300
  object-fit: contain !important;
301
  margin-left: auto !important;
 
739
 
740
  const browser = await chromium.launch({ headless: true });
741
  try {
742
+ // Use 4x scale factor for high-DPI screenshots
743
+ const deviceScaleFactor = 4;
744
  const context = await browser.newContext({
745
  deviceScaleFactor
746
  });
app/scripts/export-txt.mjs ADDED
@@ -0,0 +1,527 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env node
2
+
3
+ /**
4
+ * Export article to TXT format for book publishing
5
+ *
6
+ * This script exports the article to a simple text format with custom tags:
7
+ * - <f> NAME ANCHOR DESCRIPTION </f> for figures/images
8
+ * - <t> NAME DESCRIPTION </t> for tables
9
+ * - <c> CODE | DESCRIPTION </c> for code blocks
10
+ * - <ic> CODE </ic> for inline code
11
+ * - <l> katex-number </l> for LaTeX formulas (references exported PNGs)
12
+ *
13
+ * Usage:
14
+ * node scripts/export-txt.mjs
15
+ * npm run export:txt
16
+ *
17
+ * Output: dist/article.txt
18
+ */
19
+
20
+ import { spawn } from 'node:child_process';
21
+ import { setTimeout as delay } from 'node:timers/promises';
22
+ import { chromium } from 'playwright';
23
+ import { resolve } from 'node:path';
24
+ import { promises as fs } from 'node:fs';
25
+ import process from 'node:process';
26
+
27
+ async function run(command, args = [], options = {}) {
28
+ return new Promise((resolvePromise, reject) => {
29
+ const child = spawn(command, args, { stdio: 'inherit', shell: false, ...options });
30
+ child.on('error', reject);
31
+ child.on('exit', (code) => {
32
+ if (code === 0) resolvePromise(undefined);
33
+ else reject(new Error(`${command} ${args.join(' ')} exited with code ${code}`));
34
+ });
35
+ });
36
+ }
37
+
38
+ async function waitForServer(url, timeoutMs = 60000) {
39
+ const start = Date.now();
40
+ while (Date.now() - start < timeoutMs) {
41
+ try {
42
+ const res = await fetch(url);
43
+ if (res.ok) return;
44
+ } catch { }
45
+ await delay(500);
46
+ }
47
+ throw new Error(`Server did not start in time: ${url}`);
48
+ }
49
+
50
+ function parseArgs(argv) {
51
+ const out = {};
52
+ for (const arg of argv.slice(2)) {
53
+ if (!arg.startsWith('--')) continue;
54
+ const [k, v] = arg.replace(/^--/, '').split('=');
55
+ out[k] = v === undefined ? true : v;
56
+ }
57
+ return out;
58
+ }
59
+
60
+ function slugify(text) {
61
+ return String(text || '')
62
+ .normalize('NFKD')
63
+ .replace(/\p{Diacritic}+/gu, '')
64
+ .toLowerCase()
65
+ .replace(/[^a-z0-9]+/g, '-')
66
+ .replace(/^-+|-+$/g, '')
67
+ .slice(0, 120) || 'article';
68
+ }
69
+
70
+ /**
71
+ * Clean text content: remove extra whitespace, normalize line breaks
72
+ */
73
+ function cleanText(text) {
74
+ return String(text || '')
75
+ .replace(/\s+/g, ' ')
76
+ .trim();
77
+ }
78
+
79
+ /**
80
+ * Strip HTML tags from text
81
+ */
82
+ function stripHtml(html) {
83
+ return String(html || '')
84
+ .replace(/<[^>]*>/g, '')
85
+ .replace(/&nbsp;/g, ' ')
86
+ .replace(/&amp;/g, '&')
87
+ .replace(/&lt;/g, '<')
88
+ .replace(/&gt;/g, '>')
89
+ .replace(/&quot;/g, '"')
90
+ .replace(/&#39;/g, "'")
91
+ .trim();
92
+ }
93
+
94
+ /**
95
+ * Convert heading level to markdown syntax
96
+ */
97
+ function headingToMarkdown(level, text) {
98
+ const hashes = '#'.repeat(Math.min(level, 6));
99
+ return `${hashes} ${text}`;
100
+ }
101
+
102
+ /**
103
+ * Extract and convert article content to TXT format
104
+ */
105
+ async function extractArticleContent(page) {
106
+ return await page.evaluate(() => {
107
+ const output = [];
108
+ let globalCounter = 0; // Global counter for all visual elements (matches screenshot script)
109
+ const katexMap = new Map(); // Track unique katex formulas for referencing
110
+
111
+ // Helper: clean text
112
+ const cleanText = (text) => String(text || '').replace(/\s+/g, ' ').trim();
113
+
114
+ // Helper: strip HTML
115
+ const stripHtml = (html) => {
116
+ const div = document.createElement('div');
117
+ div.innerHTML = html;
118
+ return cleanText(div.textContent || '');
119
+ };
120
+
121
+ // Helper: get element ID or generate anchor
122
+ const getAnchor = (el) => {
123
+ if (el.id) return el.id;
124
+ // Try to find ID in parent figure
125
+ const figure = el.closest('figure');
126
+ if (figure?.id) return figure.id;
127
+ return '';
128
+ };
129
+
130
+ // Helper: parse caption to extract name and description
131
+ const parseCaptionText = (captionText, type = 'Figure') => {
132
+ if (!captionText) return { name: '', description: '' };
133
+
134
+ // Try to match patterns like:
135
+ // "Figure 1: Description"
136
+ // "Table 2: Description"
137
+ // "Fig. 3: Description"
138
+ const patterns = [
139
+ new RegExp(`^(${type}\\s*\\d+[a-z]?)\\s*[:\\-–—]\\s*(.+)$`, 'i'),
140
+ new RegExp(`^(Fig\\.?\\s*\\d+[a-z]?)\\s*[:\\-–—]\\s*(.+)$`, 'i'),
141
+ new RegExp(`^(Table\\s*\\d+[a-z]?)\\s*[:\\-–—]\\s*(.+)$`, 'i'),
142
+ ];
143
+
144
+ for (const pattern of patterns) {
145
+ const match = captionText.match(pattern);
146
+ if (match) {
147
+ return { name: match[1].trim(), description: match[2].trim() };
148
+ }
149
+ }
150
+
151
+ // No pattern found, entire text is description
152
+ return { name: '', description: captionText.trim() };
153
+ };
154
+
155
+ // Process main content
156
+ const main = document.querySelector('main');
157
+ if (!main) return 'Error: main element not found';
158
+
159
+ // Helper: get all visual elements in DOM order (same as screenshot script)
160
+ const allVisualElements = Array.from(main.querySelectorAll('.html-embed, .table-scroll > table, .image-wrapper, figure, .katex-display'));
161
+ const elementIndexMap = new Map();
162
+
163
+ // Pre-process: assign global indices to visual elements
164
+ allVisualElements.forEach((el, idx) => {
165
+ elementIndexMap.set(el, idx + 1);
166
+ });
167
+
168
+ // Walk through all child nodes
169
+ const processNode = (node) => {
170
+ const tag = node.tagName?.toLowerCase();
171
+
172
+ // Headings
173
+ if (/^h[1-6]$/.test(tag)) {
174
+ const level = parseInt(tag[1]);
175
+ const text = cleanText(node.textContent);
176
+ const hashes = '#'.repeat(level);
177
+ output.push(`\n${hashes} ${text}\n`);
178
+ return;
179
+ }
180
+
181
+ // Paragraphs
182
+ if (tag === 'p') {
183
+ const text = node.textContent?.trim();
184
+ if (text) {
185
+ // Process inline elements within paragraph
186
+ let processedText = '';
187
+ const processInline = (n) => {
188
+ if (n.nodeType === Node.TEXT_NODE) {
189
+ processedText += n.textContent;
190
+ } else if (n.tagName === 'CODE' && !n.closest('pre')) {
191
+ // Inline code
192
+ const code = cleanText(n.textContent);
193
+ processedText += `<ic>${code}</ic>`;
194
+ } else if (n.classList?.contains('katex')) {
195
+ // Inline katex - wrap in <il> tags
196
+ const formula = cleanText(n.textContent || '');
197
+ processedText += `<il>${formula}</il>`;
198
+ } else if (n.childNodes) {
199
+ n.childNodes.forEach(processInline);
200
+ }
201
+ };
202
+
203
+ node.childNodes.forEach(processInline);
204
+ output.push(processedText.trim() + '\n');
205
+ }
206
+ return;
207
+ }
208
+
209
+ // Display math (KaTeX)
210
+ if (node.classList?.contains('katex-display')) {
211
+ const globalIndex = elementIndexMap.get(node);
212
+ if (globalIndex) {
213
+ output.push(`<l>katex-${globalIndex}</l>\n`);
214
+ }
215
+ return;
216
+ }
217
+
218
+ // Code blocks
219
+ if (tag === 'pre') {
220
+ const code = node.querySelector('code');
221
+ if (code) {
222
+ const codeText = code.textContent || '';
223
+ const language = code.className.match(/language-(\w+)/)?.[1] || '';
224
+
225
+ // Try to find description from parent or next sibling
226
+ let description = '';
227
+ const figure = node.closest('figure');
228
+ if (figure) {
229
+ const caption = figure.querySelector('figcaption');
230
+ if (caption) description = stripHtml(caption.innerHTML);
231
+ }
232
+
233
+ if (description) {
234
+ output.push(`<c>${codeText.trim()} | ${description}</c>\n`);
235
+ } else {
236
+ output.push(`<c>${codeText.trim()}</c>\n`);
237
+ }
238
+ }
239
+ return;
240
+ }
241
+
242
+ // Tables
243
+ if (tag === 'table') {
244
+ // Check if this table is in a .table-scroll container (visual element)
245
+ const tableScroll = node.closest('.table-scroll');
246
+ const globalIndex = tableScroll ? elementIndexMap.get(node) : null;
247
+
248
+ // Skip if not a tracked table, but still recurse
249
+ if (!globalIndex) {
250
+ return;
251
+ }
252
+
253
+ const figure = node.closest('figure');
254
+ let name = '';
255
+ let description = '';
256
+ let anchor = '';
257
+
258
+ if (figure) {
259
+ anchor = getAnchor(figure);
260
+ const caption = figure.querySelector('figcaption');
261
+ if (caption) {
262
+ const captionText = stripHtml(caption.innerHTML);
263
+ const parsed = parseCaptionText(captionText, 'Table');
264
+ name = parsed.name;
265
+ description = parsed.description;
266
+ }
267
+ }
268
+
269
+ // If no name found, generate one with global index (matching filename format)
270
+ if (!name) {
271
+ name = `table-${globalIndex}`;
272
+ }
273
+
274
+ // Build the tag
275
+ const parts = [name];
276
+ if (anchor) parts.push(anchor);
277
+ if (description) parts.push(description);
278
+
279
+ output.push(`<t>${parts.join(' | ')}</t>\n`);
280
+
281
+ // Extract table as simple text representation
282
+ const rows = Array.from(node.querySelectorAll('tr'));
283
+ const tableText = rows.map(row => {
284
+ const cells = Array.from(row.querySelectorAll('th, td'));
285
+ return cells.map(cell => cleanText(cell.textContent)).join(' | ');
286
+ }).join('\n');
287
+
288
+ output.push(tableText + '\n\n');
289
+ return;
290
+ }
291
+
292
+ // Figures (images, embeds)
293
+ if (tag === 'figure') {
294
+ const img = node.querySelector('img');
295
+ const htmlEmbed = node.querySelector('.html-embed, .html-embed--screenshot');
296
+ const imageWrapper = node.querySelector('.image-wrapper');
297
+ const caption = node.querySelector('figcaption');
298
+
299
+ // Skip if it's not really a figure (no img, no embed, no caption)
300
+ if (!img && !htmlEmbed && !imageWrapper && !caption) return;
301
+
302
+ // Try to find the global index from the visual element
303
+ const visualElement = htmlEmbed || imageWrapper || node;
304
+ const globalIndex = elementIndexMap.get(visualElement);
305
+
306
+ if (!globalIndex) return; // Skip if not tracked
307
+
308
+ let name = '';
309
+ let anchor = getAnchor(node);
310
+ let description = '';
311
+
312
+ if (caption) {
313
+ const captionText = stripHtml(caption.innerHTML);
314
+ const parsed = parseCaptionText(captionText, 'Figure');
315
+ name = parsed.name;
316
+ description = parsed.description;
317
+ }
318
+
319
+ // Get image alt text as fallback for description
320
+ if (!description && img?.alt) {
321
+ description = img.alt;
322
+ }
323
+
324
+ // If no name found in caption, generate one with global index (matching filename format)
325
+ if (!name) {
326
+ // Determine type for naming (matches screenshot script naming)
327
+ const type = htmlEmbed ? 'embed' : 'image';
328
+ name = `${type}-${globalIndex}`;
329
+ }
330
+
331
+ // Build the tag: <f> NAME ANCHOR DESCRIPTION </f>
332
+ const parts = [name];
333
+ if (anchor) parts.push(anchor);
334
+ if (description) parts.push(description);
335
+
336
+ output.push(`<f>${parts.join(' | ')}</f>\n\n`);
337
+ return;
338
+ }
339
+
340
+ // Lists
341
+ if (tag === 'ul' || tag === 'ol') {
342
+ const items = Array.from(node.querySelectorAll(':scope > li'));
343
+ items.forEach((item, idx) => {
344
+ const bullet = tag === 'ul' ? '-' : `${idx + 1}.`;
345
+ const text = cleanText(item.textContent);
346
+ output.push(`${bullet} ${text}\n`);
347
+ });
348
+ output.push('\n');
349
+ return;
350
+ }
351
+
352
+ // Blockquotes
353
+ if (tag === 'blockquote') {
354
+ const text = cleanText(node.textContent);
355
+ output.push(`> ${text}\n\n`);
356
+ return;
357
+ }
358
+
359
+ // Notes (Note component and Sidenote)
360
+ if (node.classList?.contains('note') || node.classList?.contains('sidenote')) {
361
+ const title = node.querySelector('.note__title, .note-title')?.textContent || '';
362
+ const content = cleanText(node.textContent);
363
+
364
+ if (title) {
365
+ output.push(`<n>${title} | ${content}</n>\n\n`);
366
+ } else {
367
+ output.push(`<n>${content}</n>\n\n`);
368
+ }
369
+ return;
370
+ }
371
+
372
+ // Recurse through children for unhandled elements
373
+ if (node.children && node.children.length > 0 && !['pre', 'code', 'table', 'figure'].includes(tag)) {
374
+ try {
375
+ Array.from(node.children).forEach(processNode);
376
+ } catch (e) {
377
+ console.error('Error processing children:', e);
378
+ }
379
+ }
380
+ };
381
+
382
+ // Process all direct children of main
383
+ Array.from(main.children).forEach(processNode);
384
+
385
+ // Add metadata about visual elements
386
+ const katexCount = Array.from(main.querySelectorAll('.katex-display')).length;
387
+ if (katexCount > 0) {
388
+ output.push(`\n\n<!-- Visual elements are numbered globally in DOM order (1, 2, 3...) to match exported screenshots -->\n`);
389
+ output.push(`<!-- KaTeX formulas: ${katexCount} formulas exported as N-katex.png where N is the global index -->\n`);
390
+ }
391
+
392
+ return output.join('');
393
+ });
394
+ }
395
+
396
+ async function main() {
397
+ const cwd = process.cwd();
398
+ const args = parseArgs(process.argv);
399
+
400
+ let outFileBase = args.filename || 'article';
401
+ outFileBase = outFileBase.replace(/\.txt$/i, '');
402
+
403
+ // Build only if dist/ does not exist
404
+ const distDir = resolve(cwd, 'dist');
405
+ let hasDist = false;
406
+ try {
407
+ const st = await fs.stat(distDir);
408
+ hasDist = st && st.isDirectory();
409
+ } catch { }
410
+
411
+ if (!hasDist) {
412
+ console.log('> Building Astro site…');
413
+ await run('npm', ['run', 'build']);
414
+ } else {
415
+ console.log('> Skipping build (dist/ exists)…');
416
+ }
417
+
418
+ console.log('> Starting Astro preview…');
419
+ // Capture stdout to detect the actual port used
420
+ let capturedPort = 8080;
421
+ const preview = spawn('npm', ['run', 'preview'], {
422
+ cwd,
423
+ stdio: ['ignore', 'pipe', 'pipe'],
424
+ detached: true
425
+ });
426
+
427
+ // Listen for port in output
428
+ preview.stdout.on('data', (data) => {
429
+ const output = data.toString();
430
+ process.stdout.write(output);
431
+ const match = output.match(/http:\/\/localhost:(\d+)/);
432
+ if (match) {
433
+ capturedPort = parseInt(match[1]);
434
+ }
435
+ });
436
+
437
+ preview.stderr.on('data', (data) => {
438
+ process.stderr.write(data);
439
+ });
440
+
441
+ const previewExit = new Promise((resolvePreview) => {
442
+ preview.on('close', (code, signal) => resolvePreview({ code, signal }));
443
+ });
444
+
445
+ // Wait a bit for the server to start and output the port
446
+ await delay(3000);
447
+ const baseUrl = `http://localhost:${capturedPort}/`;
448
+
449
+ try {
450
+ await waitForServer(baseUrl, 60000);
451
+ console.log('> Server ready, extracting content…');
452
+
453
+ const browser = await chromium.launch({ headless: true });
454
+ try {
455
+ const context = await browser.newContext();
456
+ const page = await context.newPage();
457
+
458
+ // Set viewport
459
+ await page.setViewportSize({ width: 1200, height: 1400 });
460
+
461
+ // Load page (use 'load' instead of 'networkidle' to avoid timeout on heavy pages)
462
+ await page.goto(baseUrl, { waitUntil: 'load', timeout: 60000 });
463
+
464
+ // Wait for content to be ready
465
+ await page.waitForTimeout(3000);
466
+
467
+ // Wait for main content to be present
468
+ await page.waitForSelector('main', { timeout: 10000 });
469
+
470
+ // Get article title for filename
471
+ if (!args.filename) {
472
+ const title = await page.evaluate(() => {
473
+ const h1 = document.querySelector('h1.hero-title');
474
+ const t = h1 ? h1.textContent : document.title;
475
+ return (t || '').replace(/\s+/g, ' ').trim();
476
+ });
477
+ outFileBase = slugify(title);
478
+ }
479
+
480
+ console.log('> Extracting article content…');
481
+ const txtContent = await extractArticleContent(page);
482
+
483
+ // Write output
484
+ const outPath = resolve(cwd, 'dist', `${outFileBase}.txt`);
485
+ await fs.writeFile(outPath, txtContent, 'utf-8');
486
+ console.log(`✅ TXT exported: ${outPath}`);
487
+
488
+ // Copy to public folder
489
+ const publicPath = resolve(cwd, 'public', `${outFileBase}.txt`);
490
+ try {
491
+ await fs.mkdir(resolve(cwd, 'public'), { recursive: true });
492
+ await fs.copyFile(outPath, publicPath);
493
+ console.log(`✅ TXT copied to: ${publicPath}`);
494
+ } catch (e) {
495
+ console.warn('Unable to copy TXT to public/:', e?.message || e);
496
+ }
497
+
498
+ } finally {
499
+ await browser.close();
500
+ }
501
+ } finally {
502
+ // Clean shutdown
503
+ try {
504
+ if (process.platform !== 'win32') {
505
+ try { process.kill(-preview.pid, 'SIGINT'); } catch { }
506
+ }
507
+ try { preview.kill('SIGINT'); } catch { }
508
+ await Promise.race([previewExit, delay(3000)]);
509
+
510
+ if (!preview.killed) {
511
+ try {
512
+ if (process.platform !== 'win32') {
513
+ try { process.kill(-preview.pid, 'SIGKILL'); } catch { }
514
+ }
515
+ try { preview.kill('SIGKILL'); } catch { }
516
+ } catch { }
517
+ await Promise.race([previewExit, delay(1000)]);
518
+ }
519
+ } catch { }
520
+ }
521
+ }
522
+
523
+ main().catch((err) => {
524
+ console.error('❌ Error:', err.message);
525
+ console.error(err);
526
+ process.exit(1);
527
+ });
app/scripts/notion-importer/mdx-converter.mjs CHANGED
@@ -670,6 +670,33 @@ function addSpacingAroundComponents(content) {
670
  return processedContent;
671
  }
672
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
673
  /**
674
  * Fix smart quotes (curly quotes) and replace them with straight quotes
675
  * @param {string} content - Markdown content
@@ -732,6 +759,9 @@ async function processMdxContent(content, pageId = null, notionToken = null, out
732
  // Fix smart quotes first
733
  processedContent = fixSmartQuotes(processedContent);
734
 
 
 
 
735
  // Process external images first (before other transformations)
736
  if (outputDir) {
737
  // Create a temporary external images directory in the output folder
 
670
  return processedContent;
671
  }
672
 
673
+ /**
674
+ * Escape angle brackets before numbers to prevent MDX parsing errors
675
+ * In MDX, <30B would be interpreted as a JSX element, but element names can't start with numbers
676
+ * @param {string} content - Markdown content
677
+ * @returns {string} - Content with escaped angle brackets
678
+ */
679
+ function escapeAngleBracketsBeforeNumbers(content) {
680
+ console.log(' 🔧 Escaping angle brackets before numbers...');
681
+
682
+ let fixedCount = 0;
683
+
684
+ // Replace < followed by a digit with &lt; (but not inside code blocks or HTML tags)
685
+ // Pattern: < followed by a digit, not preceded by = (to avoid <=)
686
+ const processed = content.replace(/(<)(\d)/g, (match, bracket, digit) => {
687
+ fixedCount++;
688
+ return `&lt;${digit}`;
689
+ });
690
+
691
+ if (fixedCount > 0) {
692
+ console.log(` ✅ Escaped ${fixedCount} angle bracket(s) before numbers`);
693
+ } else {
694
+ console.log(' ℹ️ No angle brackets before numbers found');
695
+ }
696
+
697
+ return processed;
698
+ }
699
+
700
  /**
701
  * Fix smart quotes (curly quotes) and replace them with straight quotes
702
  * @param {string} content - Markdown content
 
759
  // Fix smart quotes first
760
  processedContent = fixSmartQuotes(processedContent);
761
 
762
+ // Escape angle brackets before numbers (e.g., <30B -> &lt;30B)
763
+ processedContent = escapeAngleBracketsBeforeNumbers(processedContent);
764
+
765
  // Process external images first (before other transformations)
766
  if (outputDir) {
767
  // Create a temporary external images directory in the output folder
app/scripts/screenshot-elements.mjs CHANGED
@@ -2,11 +2,16 @@ import { chromium } from 'playwright';
2
  import { mkdir } from 'fs/promises';
3
  import { join } from 'path';
4
 
 
 
 
 
5
  const URL = 'http://localhost:4321/?viz=true';
6
  const OUTPUT_DIR = './screenshots';
7
  const SELECTORS = ['.html-embed', '.table-scroll > table', '.image-wrapper', '.katex-display'];
8
- const DEVICE_SCALE_FACTOR = 2; // Retina quality
9
  const BASE_VIEWPORT = { width: 1200, height: 800 };
 
10
 
11
  const slugify = (value) =>
12
  String(value || '')
@@ -20,6 +25,7 @@ async function main() {
20
  await mkdir(OUTPUT_DIR, { recursive: true });
21
 
22
  console.log('🚀 Launching browser...');
 
23
  const browser = await chromium.launch({ headless: true });
24
  const context = await browser.newContext({
25
  deviceScaleFactor: DEVICE_SCALE_FACTOR,
@@ -97,7 +103,7 @@ async function main() {
97
  });
98
 
99
  const slug = slugify(label);
100
- const baseName = `${i + 1}-${type}${slug ? `--${slug}` : ''}`;
101
  const filename = `${baseName}.png`;
102
  const filepath = join(OUTPUT_DIR, filename);
103
 
@@ -108,7 +114,7 @@ async function main() {
108
  }
109
 
110
  if (type !== 'table' && type !== 'katex') {
111
- await element.evaluate((el) => {
112
  const stash = (node) => {
113
  if (!node || !(node instanceof HTMLElement)) return;
114
  node.dataset.__prevStyle = node.getAttribute('style') ?? '';
@@ -131,19 +137,65 @@ async function main() {
131
  // Aggressive cleanup only for banners
132
  const all = el.querySelectorAll('*');
133
  all.forEach((node) => stash(node));
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
 
 
 
 
 
 
 
 
135
  const svgRects = el.querySelectorAll('svg rect');
136
- svgRects.forEach((rect) => {
137
- rect.setAttribute('rx', '0');
138
- rect.setAttribute('ry', '0');
139
- rect.setAttribute('stroke', 'none');
 
140
  });
141
  }
142
- });
143
  }
144
 
145
  if (type === 'table') {
146
- const cloneId = await element.evaluate((el, idx) => {
147
  const existing = document.getElementById(`__table-clone-wrapper-${idx}`);
148
  if (existing) existing.remove();
149
 
@@ -176,12 +228,18 @@ async function main() {
176
  clone.style.minWidth = '0';
177
  clone.style.maxWidth = 'none';
178
  clone.style.tableLayout = 'auto';
 
 
 
179
 
180
  const cells = clone.querySelectorAll('th, td');
181
  cells.forEach(cell => {
182
  cell.style.width = 'auto';
183
  cell.style.minWidth = '0';
184
  cell.style.maxWidth = 'none';
 
 
 
185
  });
186
 
187
  tableScroll.appendChild(clone);
@@ -191,7 +249,7 @@ async function main() {
191
  document.body.appendChild(wrapper);
192
 
193
  return clone.id;
194
- }, i);
195
 
196
  const wrapperSelector = `#__table-clone-wrapper-${i}`;
197
  const cloneSelector = `#${cloneId}`;
@@ -213,7 +271,8 @@ async function main() {
213
 
214
  await page.locator(cloneSelector).screenshot({
215
  path: filepath,
216
- type: 'png'
 
217
  });
218
 
219
  await page.evaluate((selector) => {
@@ -221,7 +280,7 @@ async function main() {
221
  if (el) el.remove();
222
  }, wrapperSelector);
223
  } else if (type === 'katex') {
224
- const cloneId = await element.evaluate((el, idx) => {
225
  const existing = document.getElementById(`__katex-clone-wrapper-${idx}`);
226
  if (existing) existing.remove();
227
 
@@ -243,12 +302,22 @@ async function main() {
243
  clone.style.width = 'max-content';
244
  clone.style.maxWidth = 'none';
245
  clone.style.margin = '0';
 
 
 
 
 
 
 
 
 
 
246
 
247
  wrapper.appendChild(clone);
248
  document.body.appendChild(wrapper);
249
 
250
  return clone.id;
251
- }, i);
252
 
253
  const wrapperSelector = `#__katex-clone-wrapper-${i}`;
254
  const cloneSelector = `#${cloneId}`;
@@ -270,7 +339,8 @@ async function main() {
270
 
271
  await page.locator(cloneSelector).screenshot({
272
  path: filepath,
273
- type: 'png'
 
274
  });
275
 
276
  await page.evaluate((selector) => {
@@ -280,7 +350,8 @@ async function main() {
280
  } else {
281
  await element.screenshot({
282
  path: filepath,
283
- type: 'png'
 
284
  });
285
  }
286
 
@@ -316,7 +387,7 @@ async function main() {
316
  });
317
 
318
  await page.waitForTimeout(150);
319
- await element.screenshot({ path: openFilepath, type: 'png' });
320
  console.log(` ✅ ${openFilename}`);
321
 
322
  await selectHandle.evaluate((el) => {
 
2
  import { mkdir } from 'fs/promises';
3
  import { join } from 'path';
4
 
5
+ // Parse CLI arguments
6
+ const args = process.argv.slice(2);
7
+ const TRANSPARENT = args.includes('--transparent');
8
+
9
  const URL = 'http://localhost:4321/?viz=true';
10
  const OUTPUT_DIR = './screenshots';
11
  const SELECTORS = ['.html-embed', '.table-scroll > table', '.image-wrapper', '.katex-display'];
12
+ const DEVICE_SCALE_FACTOR = 4; // 4x for high-quality print
13
  const BASE_VIEWPORT = { width: 1200, height: 800 };
14
+ const FILENAME_SUFFIX = TRANSPARENT ? '-transparent' : '';
15
 
16
  const slugify = (value) =>
17
  String(value || '')
 
25
  await mkdir(OUTPUT_DIR, { recursive: true });
26
 
27
  console.log('🚀 Launching browser...');
28
+ if (TRANSPARENT) console.log('🔲 Transparent mode enabled (omitBackground: true)');
29
  const browser = await chromium.launch({ headless: true });
30
  const context = await browser.newContext({
31
  deviceScaleFactor: DEVICE_SCALE_FACTOR,
 
103
  });
104
 
105
  const slug = slugify(label);
106
+ const baseName = `${i + 1}-${type}${slug ? `--${slug}` : ''}${FILENAME_SUFFIX}`;
107
  const filename = `${baseName}.png`;
108
  const filepath = join(OUTPUT_DIR, filename);
109
 
 
114
  }
115
 
116
  if (type !== 'table' && type !== 'katex') {
117
+ await element.evaluate((el, isTransparent) => {
118
  const stash = (node) => {
119
  if (!node || !(node instanceof HTMLElement)) return;
120
  node.dataset.__prevStyle = node.getAttribute('style') ?? '';
 
137
  // Aggressive cleanup only for banners
138
  const all = el.querySelectorAll('*');
139
  all.forEach((node) => stash(node));
140
+ }
141
+
142
+ // Also target d3-loss-curves (banner component)
143
+ const lossCurves = el.querySelector('.d3-loss-curves');
144
+ if (lossCurves) {
145
+ lossCurves.style.background = 'transparent';
146
+ lossCurves.style.border = 'none';
147
+ lossCurves.style.borderRadius = '0';
148
+ }
149
+
150
+ // In transparent mode, neutralize backgrounds but preserve UI elements
151
+ if (isTransparent) {
152
+ // Step 1: Save computed backgrounds of UI elements we want to preserve
153
+ const uiSelectors = '.legend, [class*="legend"], .tooltip, [class*="tooltip"], .d3-tooltip, select, button, input, [class*="swatch"], [class*="label"]';
154
+ const uiElements = el.querySelectorAll(uiSelectors);
155
+ const savedStyles = new Map();
156
+ uiElements.forEach((uiEl) => {
157
+ const computed = window.getComputedStyle(uiEl);
158
+ savedStyles.set(uiEl, {
159
+ background: computed.background,
160
+ backgroundColor: computed.backgroundColor
161
+ });
162
+ });
163
+
164
+ // Step 2: Apply transparency to EVERYTHING
165
+ el.style.setProperty('background', 'transparent', 'important');
166
+ el.style.setProperty('background-color', 'transparent', 'important');
167
+ el.style.setProperty('background-image', 'none', 'important');
168
+
169
+ const allElements = el.querySelectorAll('*');
170
+ allElements.forEach((node) => {
171
+ if (node instanceof HTMLElement) {
172
+ node.style.setProperty('background', 'transparent', 'important');
173
+ node.style.setProperty('background-color', 'transparent', 'important');
174
+ node.style.setProperty('background-image', 'none', 'important');
175
+ }
176
+ });
177
 
178
+ // Step 3: Restore UI elements backgrounds
179
+ savedStyles.forEach((styles, uiEl) => {
180
+ if (styles.backgroundColor && styles.backgroundColor !== 'rgba(0, 0, 0, 0)') {
181
+ uiEl.style.setProperty('background-color', styles.backgroundColor, 'important');
182
+ }
183
+ });
184
+
185
+ // Target SVG rect elements that look like backgrounds
186
  const svgRects = el.querySelectorAll('svg rect');
187
+ svgRects.forEach((rect, idx) => {
188
+ const fill = (rect.getAttribute('fill') || '').toLowerCase();
189
+ if (idx === 0 || fill === 'white' || fill.startsWith('#fff') || fill.includes('255, 255, 255')) {
190
+ rect.setAttribute('fill', 'none');
191
+ }
192
  });
193
  }
194
+ }, TRANSPARENT);
195
  }
196
 
197
  if (type === 'table') {
198
+ const cloneId = await element.evaluate((el, idx, isTransparent) => {
199
  const existing = document.getElementById(`__table-clone-wrapper-${idx}`);
200
  if (existing) existing.remove();
201
 
 
228
  clone.style.minWidth = '0';
229
  clone.style.maxWidth = 'none';
230
  clone.style.tableLayout = 'auto';
231
+ if (isTransparent) {
232
+ clone.style.background = 'transparent';
233
+ }
234
 
235
  const cells = clone.querySelectorAll('th, td');
236
  cells.forEach(cell => {
237
  cell.style.width = 'auto';
238
  cell.style.minWidth = '0';
239
  cell.style.maxWidth = 'none';
240
+ if (isTransparent) {
241
+ cell.style.background = 'transparent';
242
+ }
243
  });
244
 
245
  tableScroll.appendChild(clone);
 
249
  document.body.appendChild(wrapper);
250
 
251
  return clone.id;
252
+ }, i, TRANSPARENT);
253
 
254
  const wrapperSelector = `#__table-clone-wrapper-${i}`;
255
  const cloneSelector = `#${cloneId}`;
 
271
 
272
  await page.locator(cloneSelector).screenshot({
273
  path: filepath,
274
+ type: 'png',
275
+ omitBackground: TRANSPARENT
276
  });
277
 
278
  await page.evaluate((selector) => {
 
280
  if (el) el.remove();
281
  }, wrapperSelector);
282
  } else if (type === 'katex') {
283
+ const cloneId = await element.evaluate((el, idx, isTransparent) => {
284
  const existing = document.getElementById(`__katex-clone-wrapper-${idx}`);
285
  if (existing) existing.remove();
286
 
 
302
  clone.style.width = 'max-content';
303
  clone.style.maxWidth = 'none';
304
  clone.style.margin = '0';
305
+ if (isTransparent) {
306
+ clone.style.background = 'transparent';
307
+ // Neutralize white backgrounds in katex elements
308
+ const allElements = clone.querySelectorAll('*');
309
+ allElements.forEach((node) => {
310
+ if (node instanceof HTMLElement) {
311
+ node.style.background = 'transparent';
312
+ }
313
+ });
314
+ }
315
 
316
  wrapper.appendChild(clone);
317
  document.body.appendChild(wrapper);
318
 
319
  return clone.id;
320
+ }, i, TRANSPARENT);
321
 
322
  const wrapperSelector = `#__katex-clone-wrapper-${i}`;
323
  const cloneSelector = `#${cloneId}`;
 
339
 
340
  await page.locator(cloneSelector).screenshot({
341
  path: filepath,
342
+ type: 'png',
343
+ omitBackground: TRANSPARENT
344
  });
345
 
346
  await page.evaluate((selector) => {
 
350
  } else {
351
  await element.screenshot({
352
  path: filepath,
353
+ type: 'png',
354
+ omitBackground: TRANSPARENT
355
  });
356
  }
357
 
 
387
  });
388
 
389
  await page.waitForTimeout(150);
390
+ await element.screenshot({ path: openFilepath, type: 'png', omitBackground: TRANSPARENT });
391
  console.log(` ✅ ${openFilename}`);
392
 
393
  await selectHandle.evaluate((el) => {
app/src/content/article.mdx CHANGED
The diff for this file is too large to render. See raw diff
 
app/src/pages/dataviz.astro CHANGED
@@ -247,7 +247,7 @@ const visualsWithMeta = visuals.map((item: any) => {
247
  <p class="header-desc">{item.desc || item.caption}</p>
248
  )}
249
  {item.anchorId && (
250
- <a href={`/#${item.anchorId}`} class="header-link" target="_blank" rel="noopener">
251
  View in article →
252
  </a>
253
  )}
 
247
  <p class="header-desc">{item.desc || item.caption}</p>
248
  )}
249
  {item.anchorId && (
250
+ <a href={`/#${item.anchorId}`} class="header-link">
251
  View in article →
252
  </a>
253
  )}
app/src/pages/index.astro CHANGED
@@ -197,6 +197,36 @@ const licence =
197
  } catch {}
198
  })();
199
  </script>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
200
  <script type="module" src="/scripts/color-palettes.js"></script>
201
 
202
  <script src="https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js"></script>
 
197
  } catch {}
198
  })();
199
  </script>
200
+ <!-- Hash Router for HF Spaces compatibility -->
201
+ <script is:inline>
202
+ (() => {
203
+ // Routes map: #/route -> actual page path
204
+ const routes = {
205
+ '/dataviz': '/dataviz',
206
+ '/trackio': '/trackio',
207
+ };
208
+
209
+ function handleHashRoute() {
210
+ const hash = window.location.hash;
211
+ // Only handle hashes that start with #/ (route pattern)
212
+ if (!hash.startsWith('#/')) return;
213
+
214
+ const route = hash.slice(1); // Remove the # prefix
215
+ const targetPath = routes[route];
216
+
217
+ if (targetPath) {
218
+ // Redirect to the actual page
219
+ window.location.href = targetPath;
220
+ }
221
+ }
222
+
223
+ // Check on page load
224
+ handleHashRoute();
225
+
226
+ // Also listen for hash changes (in case user navigates via hash)
227
+ window.addEventListener('hashchange', handleHashRoute);
228
+ })();
229
+ </script>
230
  <script type="module" src="/scripts/color-palettes.js"></script>
231
 
232
  <script src="https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js"></script>
app/yarn.lock CHANGED
The diff for this file is too large to render. See raw diff