Skip to content

Commit

Permalink
u
Browse files Browse the repository at this point in the history
  • Loading branch information
HDembinski committed Jan 7, 2025
1 parent 7f07a07 commit d15d861
Show file tree
Hide file tree
Showing 5 changed files with 97 additions and 112 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ __pycache__
.vscode
_site
posts/scraped
posts/scrape.py

# pixi environments
.pixi
Expand Down
38 changes: 38 additions & 0 deletions pixi.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions pixi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,5 @@ pip = ">=24.3.1,<25"

[pypi-dependencies]
markdownify = ">=0.14.1, <0.15"
playwright = ">=1.49.1, <2"
ollama = "*"
110 changes: 56 additions & 54 deletions posts/parsing_webpages_with_llm.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,20 +32,30 @@
"source": [
"## Reading a dynamic web pages and convert HTML to Markdown\n",
"\n",
"The code for this part was written by ChatGPT. At least on Windows, the Playwright code cannot be run inside a Jupyter notebook, so I had to use a script. Here is the content of the strict, which downloads the dynamic HTML and converts it to Markdown and saves the Markdown files in the subdirectory `scraped`.\n",
"The code for this part was written by ChatGPT. At least on Windows, the Playwright code cannot be run inside a Jupyter notebook, so I had to use a script. Here is the content of the strict, which downloads the dynamic HTML and converts it to Markdown and saves the Markdown files in the subdirectory `scraped`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import subprocess\n",
"from pathlib import Path\n",
"\n",
"```py\n",
"script = r\"\"\"\n",
"from playwright.sync_api import sync_playwright\n",
"from markdownify import markdownify as md\n",
"from pathlib import Path\n",
"\n",
"urls = \"\"\"\n",
"urls = '''\n",
"https://inspirehep.net/literature/1889335\n",
"https://inspirehep.net/literature/2512593\n",
"https://inspirehep.net/literature/2017107\n",
"https://inspirehep.net/literature/2687746\n",
"https://inspirehep.net/literature/1928162\n",
"\"\"\"\n",
"https://inspirehep.net/literature/2727838\n",
"'''\n",
"\n",
"urls = [x.strip() for x in urls.split(\"\\n\") if x and not x.isspace()]\n",
"\n",
Expand Down Expand Up @@ -84,7 +94,13 @@
"\n",
"\n",
"scrape_to_markdown(urls, \"scraped\")\n",
"```"
"\"\"\"\n",
"\n",
"if not Path(\"scraped\").exists():\n",
" with open(\"scrape.py\", \"w\", encoding=\"utf-8\") as f:\n",
" f.write(script)\n",
"\n",
" subprocess.run([\"python\", \"scrape.py\"])"
]
},
{
Expand Down Expand Up @@ -165,20 +181,6 @@
"The converted Markdown contains mistakes, where the conversion process garbled up the structure of the document. Let's see whether the LLM can make sense of this raw text. We want it to extract the authors, the journal data, the title, and the DOI."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import ollama\n",
"from pathlib import Path\n",
"\n",
"input_dir = Path(\"scraped\")\n",
"\n",
"documents = [fn.open(encoding=\"utf-8\").read() for fn in input_dir.glob(\"*.md\")]"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -196,43 +198,42 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import ollama\n",
"from pathlib import Path\n",
"\n",
"input_dir = Path(\"scraped\")\n",
"\n",
"documents = [fn.open(encoding=\"utf-8\").read() for fn in input_dir.glob(\"*.md\")]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.0: Roel Aaij, JHEP 01 (2022) 166, \"Measurement of prompt charged-particle production in pp collisions at s=13 TeV\", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)\n",
"0.1: Roel Aaij, JHEP 01 (2022) 166, \"Measurement of prompt charged-particle production in pp collisions at s=13 TeV\", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)\n",
"0.2: Roel Aaij, JHEP 01 (2022) 166, \"Measurement of prompt charged-particle production in pp collisions at s=13 TeV\", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166).\n",
"1.0: Flavia Gesualdi, Hans Dembinski, Kenji Shinozaki, Daniel A. Supanitsky, Tanguy Pierog, PoS ICRC2021 (2021) 473, \"On the muon scale of air showers and its application to the AGASA data\", [10.22323/1.395.0473](https://doi.org/10.22323/1.395.0473)\n",
"1.1: Flavia Gesualdi, Hans Dembinski, Kenji Shinozaki, Daniel A. Supanitsky, Tanguy Pierog, PoS ICRC2021 (2021) 473, \"On the muon scale of air showers and its application to the AGASA data\", [10.22323/1.395.0473](https://doi.org/10.22323/1.395.0473).\n",
"1.2: Flavia Gesualdi, Hans Dembinski, Kenji Shinozaki, Daniel A. Supanitsky, Tanguy Pierog, PoS ICRC2021 (2021) 473, \"On the muon scale of air showers and its application to the AGASA data\", [10.22323/1.395.0473](https://doi.org/10.22323/1.395.0473).\n",
"2.0: Tanguy Pierog, Sebastian Baur, Hans Dembinski, Matías Perlin, Ralf Ulrich, PoS ICRC2021 (2021) 469, \"When heavy ions meet cosmic rays: potential impact of QGP formation on the muon puzzle\", [10.22323/1.395.0469](https://doi.org/10.22323/1.395.0469).\n",
"2.1: Tanguy Pierog, Sebastian Baur, Hans Dembinski, Matías Perlin, Ralf Ulrich, PoS ICRC2021 (2021) 469, \"When heavy ions meet cosmic rays: potential impact of QGP formation on the muon puzzle\", [10.22323/1.395.0469](https://doi.org/10.22323/1.395.0469).\n",
"2.2: Tanguy Pierog, Sebastian Baur, Hans Dembinski, Matías Perlin, Ralf Ulrich, PoS ICRC2021 (2021) 469, \"When heavy ions meet cosmic rays: potential impact of QGP formation on the muon puzzle\", [10.22323/1.395.0469](https://doi.org/10.22323/1.395.0469).\n",
"3.0: Hans Dembinski, Matthew Kenzie, Christoph Langenbruch, Michael Schmelling, Nucl.Instrum.Meth.A 1040 (2022) 167270, \"Custom Orthogonal Weight functions (COWs) for event classification\", [10.1016/j.nima.2022.167270](https://doi.org/10.1016/j.nima.2022.167270)\n",
"3.1: Hans Dembinski, Matthew Kenzie, Christoph Langenbruch, Michael Schmelling, Nucl.Instrum.Meth.A 1040 (2022) 167270, \"Custom Orthogonal Weight functions (COWs) for event classification\", [10.1016/j.nima.2022.167270](https://doi.org/10.1016/j.nima.2022.167270).\n",
"3.2: Hans Dembinski, Matthew Kenzie, Christoph Langenbruch, Michael Schmelling, Nucl.Instrum.Meth.A 1040 (2022) 167270, \"Custom Orthogonal Weight functions (COWs) for event classification\", [10.1016/j.nima.2022.167270](https://doi.org/10.1016/j.nima.2022.167270).\n",
"4.0: Johannes Albrecht, Lorenzo Cazon, Hans Dembinski, Anatoli Fedynitch, Karl-Heinz Kampert, Astrophys.Space Sci. 367 (2022) 3, \"The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider\", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5).\n",
"4.1: Johannes Albrecht, Lorenzo Cazon, Hans Dembinski, Anatoli Fedynitch, Karl-Heinz Kampert, Astrophys.Space Sci., \"The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider\", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)\n",
"4.2: Johannes Albrecht, Hans Dembinski, Anatoli Fedynitch, Karl-Heinz Kampert, Astrophys.Space Sci., \"The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider\", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)\n",
"5.0: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur.Phys.J.C 82 (2022) 1043, \"A new maximum-likelihood method for template fits\", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).\n",
"5.1: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur.Phys.J.C 82 (2022) 1043, \"A new maximum-likelihood method for template fits\", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).\n",
"5.2: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur.Phys.J.C 82 (2022) 1043, \"A new maximum-likelihood method for template fits\", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z)\n",
"6.0: David Maurin, Markus Ahlers, Hans Dembinski, Andreas Haungs, Pierre-Simon Mangeard, Eur. Phys. J. C 83 (2023) 971, \"A cosmic-ray database update: CRDB v4.1\", [10.1140/epjc/s10052-023-12092-8](https://doi.org/10.1140/epjc/s10052-023-12092-8).\n",
"6.1: David Maurin, Markus Ahlers, Hans Dembinski, Andreas Haungs, Pierre-Simon Mangeard, Eur. Phys. J. C 83 (2023) 971, \"A cosmic-ray database update: CRDB v4.1\", [10.1140/epjc/s10052-023-12092-8](https://doi.org/10.1140/epjc/s10052-023-12092-8).\n",
"6.2: David Maurin, Markus Ahlers, Hans Dembinski, Andreas Haungs, Pierre-Simon Mangeard, Eur.Phys.J.C 83 (2023) 971, \"A cosmic-ray database update: CRDB v4.1\", [10.1140/epjc/s10052-023-12092-8](https://doi.org/10.1140/epjc/s10052-023-12092-8).\n",
"7.0: Hans Dembinski, Anatoli Fedynitch, Anton Prosekin, PoS ICRC2023 (2023) 189, \"Chromo: An event generator frontend for particle and astroparticle physics\", [10.22323/1.444.0189](https://doi.org/10.22323/1.444.0189).\n",
"7.1: Hans Dembinski, Anatoli Fedynitch, Anton Prosekin, PoS ICRC2023 (2023) 189, \"Chromo: An event generator frontend for particle and astroparticle physics\", [10.22323/1.444.0189](https://doi.org/10.22323/1.444.0189).\n",
"7.2: Hans Dembinski, Anatoli Fedynitch, Anton Prosekin, PoS ICRC2023 (2023) 189, \"Chromo: An event generator frontend for particle and astroparticle physics\", [10.22323/1.444.0189](https://doi.org/10.22323/1.444.0189).\n",
"8.0: L. Cazon, H.P. Dembinski, G. Parente, F. Riehn, A.A. Watson, PoS ICRC2023 (2023) 431, \"The muon measurements of Haverah Park and their connection to the muon puzzle\", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).\n",
"8.1: L. Cazon, H.P. Dembinski, G. Parente, F. Riehn, A.A. Watson, PoS ICRC2023 (2023) 431, \"The muon measurements of Haverah Park and their connection to the muon puzzle\", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431)\n",
"8.2: L. Cazon, H.P. Dembinski, G. Parente, F. Riehn, A.A. Watson, PoS ICRC2023 (2023) 431, \"The muon measurements of Haverah Park and their connection to the muon puzzle\", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).\n",
"9.0: Hans Dembinski, Michael Schmelling, arXiv:2110.00294, \"Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments\", [arxiv.org/abs/2110.00294](https://arxiv.org/abs/2110.00294).\n",
"9.1: Hans Dembinski, Michael Schmelling, arXiv:2110.00294, \"Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments\", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166).\n",
"9.2: Hans Dembinski, Michael Schmelling, arXiv:2110.00294, \"Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments\", [https://arxiv.org/abs/2110.00294](https://arxiv.org/abs/2110.00294)\n"
"0.0: Roel Aaij et al., JHEP 01 (2022) 166, \"Measurement of prompt charged-particle production in pp collisions at s=13 TeV\", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)\n",
"0.1: Roel Aaij et al., JHEP 01 (2022) 166, \"Measurement of prompt charged-particle production in pp collisions at s=13 TeV\", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)\n",
"0.2: Roel Aaij et al., JHEP 01 (2022) 166, \"Measurement of prompt charged-particle production in pp collisions at s=13 TeV\", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)\n",
"1.0: Johannes Albrecht, Hans Dembinski, Anatoli Fedynitch, Karl-Heinz Kampert, Astropart.Space Sci. 367 (2022) 3, 27, \"The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider\", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)\n",
"1.1: Johanes Albrecht, Lorenzo Cazon, Hans Dembinski, Anatoli Fedynitch, Karl-Heinz Kampert, Astrophys.Space Sci., \"The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider\", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)\n",
"1.2: Johannes Albrecht et al., Astrophys.Space Sci. 367 (2022) 3, \"The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider\", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)\n",
"2.0: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur.Phys.J.C 82 (2022) 1043, \"A new maximum-likelihood method for template fits\", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).\n",
"2.1: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur.Phys.J.C 82 (2022) 1043, \"A new maximum-likelihood method for template fits\", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).\n",
"2.2: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur.Phys.J.C 82 (2022) 1043, \"A new maximum-likelihood method for template fits\", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).\n",
"3.0: L. Cazon et al., PoS ICRC2023 (2023) 431, \"The muon measurements of Haverah Park and their connection to the muon puzzle\", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).\n",
"3.1: L. Cazon et al., PoS ICRC2023 (2023) 431, \"The muon measurements of Haverah Park and their connection to the muon puzzle\", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).\n",
"3.2: L. Cazon et al., PoS ICRC2023 (2023) 431, \"The muon measurements of Haverah Park and their connection to the muon puzzle\", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).\n",
"4.0: Hans Dembinski, Michael Schmelling, \"Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments\", [arXiv:2110.00294](https://arxiv.org/abs/2110.00294)\n",
"4.1: Hans Dembinski, Michael Schmelling, arXiv:2110.00294, \"Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments\", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)\n",
"4.2: Hans Dembinski, Michael Schmelling, \"Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments\", arXiv:2110.00294\n"
]
}
],
Expand Down Expand Up @@ -266,9 +267,10 @@
" d = doc[:doc.index(\"###\")]\n",
" prompt = prompt_template.format(text=d)\n",
" for trial in range(3):\n",
" response = ollama.generate(model='llama3-chatqa', prompt=prompt, options={\"temperature\": 0.3})\n",
" # as a tiny bit of post-processing we replace newlines with spaces\n",
" text = response.response.replace('\\n', '')\n",
" # a low temperate seems to make the output more reliable\n",
" response = ollama.generate(model='llama3-chatqa', prompt=prompt, options={\"temperature\": 0.3, \"seed\": trial})\n",
" # tiny bit of post-processing: replace newlines with spaces, trim whitespace\n",
" text = response.response.replace('\\n', '').strip()\n",
" print(f\"{idoc}.{trial}: {text}\")"
]
},
Expand Down
58 changes: 0 additions & 58 deletions posts/scrape_to_markdown.py

This file was deleted.

0 comments on commit d15d861

Please sign in to comment.