{"id":470125,"date":"2025-08-09T09:00:21","date_gmt":"2025-08-09T09:00:21","guid":{"rendered":"http:\/\/savepearlharbor.com\/?p=470125"},"modified":"-0001-11-30T00:00:00","modified_gmt":"-0001-11-29T21:00:00","slug":"","status":"publish","type":"post","link":"https:\/\/savepearlharbor.com\/?p=470125","title":{"rendered":"<span>Docling in Working with Texts, Languages, and Knowledge<\/span>"},"content":{"rendered":"<div><!--[--><!--]--><\/div>\n<div id=\"post-content-body\">\n<div>\n<div class=\"article-formatted-body article-formatted-body article-formatted-body_version-2\">\n<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<figure class=\"full-width\"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/44d\/87a\/b11\/44d87ab11dd5e0183bc6b64314d47799.png\" alt=\"Docling in Working with Texts, Languages, and Knowledge\" title=\"Docling in Working with Texts, Languages, and Knowledge\" width=\"2048\" height=\"855\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/44d\/87a\/b11\/44d87ab11dd5e0183bc6b64314d47799.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/44d\/87a\/b11\/44d87ab11dd5e0183bc6b64314d47799.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/p>\n<div><figcaption>Docling in Working with Texts, Languages, and Knowledge<\/figcaption><\/div>\n<\/figure>\n<p>Hi everyone. In the context of our research project, we were solving the problem of automating academic submission workflows, which led us to discover a platform called\u00a0<strong>Docling<\/strong>.<\/p>\n<p>Together, we explore the role of Docling in reshaping how research data can be represented, reused, and reasoned over in both human and machine-readable formats.<\/p>\n<p>As part of the development of a scientific research project \u201cAdvanced Scientific Research Projects\u201d (ASRP) aimed at creating a new-generation academic journal named\u00a0<a href=\"https:\/\/asrp.science\/en\" rel=\"noopener noreferrer nofollow\">ASRP.science<\/a>, we encountered a number of methodological and technical challenges. One of the key issues was the automation of parsing academic research documents in order to streamline the submission, processing, and archiving of materials provided by researchers. Modern digital publications increasingly require tools that not only accept text submissions but also structure knowledge, annotate content, handle metadata, terminology, and interlinked concepts, and ensure that outputs are machine-readable for downstream use.<\/p>\n<p>During the initial stages of analysis and prototyping, we formulated the following goal:<\/p>\n<blockquote>\n<p><em>Develop or adapt a platform capable of effectively handling linguistically and conceptually annotated texts, compatible with academic publication formats, and supporting export and interoperability with LLM pipelines and knowledge bases.<\/em><\/p>\n<\/blockquote>\n<p>While exploring available solutions, we came across an open-source project called\u00a0<strong>Docling<\/strong>, designed for linguists, researchers, and digital humanists. Although originally built for working with language and text in a linguistic context, Docling turned out to be surprisingly well-aligned with our needs.<\/p>\n<p><strong>Why Docling Was a Relevant Choice:<\/strong><\/p>\n<ul>\n<li>\n<p><strong>Structured data format (JSON):<\/strong>\u00a0Docling stores data in a structured, lightweight JSON format, which integrates easily into research pipelines and software development workflows. This made it easy for us to feed Docling outputs into other tools.<\/p>\n<\/li>\n<li>\n<p><strong>Graph- and tree-based knowledge representation:<\/strong>\u00a0It supports graph and tree representations of knowledge, crucial for semantic parsing and linking concepts. A research paper in Docling isn\u2019t just text; it becomes a network of interrelated nodes (e.g., sections, definitions, examples).<\/p>\n<\/li>\n<li>\n<p><strong>Flexible corpora creation:<\/strong>\u00a0We can create flexible corpora, including lexical databases or grammatical descriptions, within the same environment. This was useful for building a \u201clanguage archive\u201d of terms and definitions encountered in our papers.<\/p>\n<\/li>\n<li>\n<p><strong>Visualization and extensibility:<\/strong>\u00a0Docling offers visualization capabilities (tree views, tables) and is modular\/extensible. We could visualize argument structures or sentence parse trees, then extend the platform with custom scripts.<\/p>\n<\/li>\n<li>\n<p><strong>Interface with AI systems:<\/strong>\u00a0Perhaps most interestingly, Docling can serve as an interface for interaction with AI systems, including LLMs. The structured outputs (in JSON) can be fed into machine learning pipelines or used to improve prompt engineering by providing context in a structured way.<\/p>\n<\/li>\n<\/ul>\n<p>In the following sections, we will explore the architecture of Docling, its core features, example use cases, and the platform\u2019s potential for adaptation within our research infrastructure.<\/p>\n<h3>Docling in Academic Submission Workflows<\/h3>\n<p><strong>\ud83e\udde0 Key Points to Cover:<\/strong><\/p>\n<ul>\n<li>\n<p><strong>Automatic structuring of research content:<\/strong>\u00a0Docling allows users to break down complex research documents into modular, interlinked nodes (e.g. hypotheses, arguments, citations, definitions) instead of treating a document as one big block of text. For example, in our project pipeline we use Docling\u2019s API to parse an uploaded PDF into a graph of\u00a0<em>nodes<\/em>. Each node might correspond to a section heading, a paragraph, a bibliography entry, etc., all linked together by the document\u2019s structure. Programmatically, this looks like:<\/p>\n<\/li>\n<\/ul>\n<pre><code class=\"python\">from docling.document_converter import DocumentConverter  # Parse a PDF file into a DoclingDocument and iterate over its nodes converter = DocumentConverter() docling_doc = converter.convert(\"sample_article.pdf\").document for node, _ in docling_doc.iterate_items():     data = node.model_dump()      # convert DocItem to a Python dict     print(f\"{data['label']}: {data['text'][:50]}...\")<\/code><\/pre>\n<p><em>Code example: Using Docling to convert a PDF into structured nodes.<\/em>\u00a0In this snippet, each\u00a0<code>node<\/code>\u00a0is a Docling\u00a0<strong>DocItem<\/strong>with a\u00a0<code>label<\/code>\u00a0(type of content, e.g. Title, Heading, Paragraph) and\u00a0<code>text<\/code>\u00a0content. This kind of automated segmentation means a submitted paper is not just a single blob of text; it\u2019s a hierarchy of interrelated parts. For instance, running the above on a sample article prints something like:<\/p>\n<pre><code class=\"python\">Title: Forecasting Social, Geopolitical, and Economic Events Using the 'Banchenko-Technology' Paragraph: The unconscious can be understood as an entity subject to certain laws and in a dynamic relationship with consciousness...<\/code><\/pre>\n<p>Each line represents a node (with its type and an excerpt of text). Under the hood, Docling has taken the document and broken it into pieces, classifying them (e.g., that first node was recognized as a Title).<\/p>\n<ul>\n<li>\n<p><strong>Text + metadata hybrid:<\/strong>\u00a0Each node in Docling can contain text (the content of that segment, such as a paragraph or example sentence)\u00a0<em>and<\/em>\u00a0associated metadata. Metadata might include the author, source, tags, time period, related terms, or any custom fields you define. This is ideal for academic articles that need to be parsed, sorted, and indexed. In our pipeline, for instance, we could attach metadata like page numbers or confidence scores to each node. Docling\u2019s data model (built on Pydantic) makes it easy to handle these metadata fields as Python objects. (In code, we simply call\u00a0<code>node.model_dump()<\/code>\u00a0to get a JSON-ready dict of all of a node\u2019s fields, including text and any annotations.)<\/p>\n<\/li>\n<li>\n<p><strong>Knowledge graph of the paper:<\/strong>\u00a0A submitted paper is not just a static PDF \u2014 once in Docling, it becomes a mini knowledge graph. Sections, arguments, and concepts are connected and queryable. For example, a\u00a0<em>Conclusion<\/em>section node might have links to multiple\u00a0<em>Result<\/em>\u00a0section nodes that it references. Docling inherently supports linking any node to any other, allowing us to represent relationships like \u201cFigure X illustrates Theory Y\u201d or \u201cDefinition A is used in Section B\u201d. The result is that the linear document transforms into a network of information. Reviewers (or algorithms) can trace the logic of an argument through this tree\/graph structure, instead of being confined to linear reading. This was a huge plus for us in thinking about machine-assisted peer review.<\/p>\n<\/li>\n<li>\n<p><strong>Better peer-review preparation:<\/strong>\u00a0Because of the structured, node-based approach, reviewers can traverse the argumentation structure more intuitively. For example, they can quickly isolate the central hypothesis node and see all evidence nodes linked to it, rather than hunting through the text. This tree-like structure supports critical thinking and logic validation by making the paper\u2019s knowledge graph explicit.<\/p>\n<\/li>\n<li>\n<p><strong>Interoperability with research platforms:<\/strong>\u00a0Docling\u2019s JSON structure can be easily integrated into other tools:<\/p>\n<ul>\n<li>\n<p><strong>Research management tools (like Zotero):<\/strong>\u00a0We can export bibliographic metadata or structured references from Docling and import them into citation managers.<\/p>\n<\/li>\n<li>\n<p><strong>Semantic indexing engines:<\/strong>\u00a0Because each piece of content is a node with metadata, we can feed the collection into a semantic search or indexing system. For instance, one could dump all Docling nodes into an Elasticsearch index, enabling fine-grained search (find all occurrences of a certain concept, or all evidence supporting a given claim).<\/p>\n<\/li>\n<li>\n<p><strong>Machine learning pipelines:<\/strong>\u00a0Perhaps most significantly, the clean JSON output can feed ML pipelines for automated analysis or review support. We experimented with an AI agent that consumes Docling-structured data to extract insights. For example, given a Docling-parsed document, our AI agent can answer questions like\u00a0<em>\u201cWhat are the author affiliations?\u201d<\/em>\u00a0by traversing the JSON structure. We simply provide the agent with the structured text and ask for what we need. In code, it looked like this:<\/p>\n<pre><code class=\"python\">query = \"Extract all metadata from the document and return a single JSON object.\" input_data = {     \"messages\": [(\"user\", query)],     \"file_path\": \"sample_article.pdf\" } # Invoke the LLM agent with the structured document as context result_state = agent_executor.ainvoke(input_data) final_answer = result_state[\"messages\"][-1].content metadata = json.loads(final_answer)<\/code><\/pre>\n<p><em>Code example: Using a LangChain-powered agent to query a Docling document.<\/em>\u00a0In our FastAPI service, we pass the user\u2019s query and the document\u2019s file path into a LangChain agent (<code>agent_executor<\/code>). The agent has been configured with tools that interface with Docling \u2013 for example, one tool can get the document\u2019s title, another can get the authors, etc., all by utilizing the Docling-parsed content under the hood. The agent\u2019s final answer (here,\u00a0<code>final_answer<\/code>) is a JSON string. We then parse it into a Python dict\u00a0<code>metadata<\/code>. This\u00a0<strong>metadata JSON<\/strong>\u00a0is the result of the LLM analyzing the Docling structure of the document. It might look something like:<\/p>\n<pre><code class=\"python\">{     \"title\": \"Forecasting Social, Geopolitical, and Economic Events Using the 'Banchenko-Technology'\",     \"authors\": [         \"Denis Banchenko\",         \"Mykhailo Kapustin\"     ],     \"abstract\": \"This article presents a study on the interdependence between subjective experience gained through lucid dreaming and objectively observable processes dependent on collective consciousness. Relying on theoretical research in the field of consciousness, collective conscious and unconscious, and experimental data, assumptions about the nature of such interrelation have been made. A concept is proposed for discussion on the formation of structures capable of utilizing such phenomena for controlled types of activities such as: making economic-political decisions, managing investments, and shaping social transformations. The article introduces a digital system for managing event forecasts and market trends developed by BlackRock Corporation. Various aspects of the market\u2014overvalued companies, potential risks, the impact of political projects on the financial world, as well as possible areas of future financial crises \u2014 are under the purview of the artificial intelligence used within BlackRock.\",     \"doi\": \"10.33425\/2690-8077.1119\",     \"keywords\": [         \"states\",         \"predictions\",         \"forecasting\",         \"event\",         \"synchronization\"     ] }<\/code><\/pre>\n<p><em>Example output:<\/em>\u00a0Here the agent has identified key fields from the document \u2014 title, authors, affiliations, abstract, etc. This JSON was generated by the LLM after it utilized Docling to navigate the document\u2019s structure. We could directly feed this structured output into downstream systems: for example, a database of article metadata, or a training corpus for an ML model.<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>Reusable components:<\/strong>\u00a0Once a term, concept, or quote is added as a node in Docling, it can be reused across documents or become part of a larger corpus. This is ideal for journals with recurring themes or researchers building a cross-referenced library of concepts. In our use, if multiple papers defined the same technical term, we could merge those into a single lexical node referenced by all occurrences \u2014 effectively creating a dynamic glossary.<\/p>\n<\/li>\n<\/ul>\n<p><strong>\ud83c\udfaf Goal:<\/strong>\u00a0Demonstrate how Docling transforms the way we submit and process academic work \u2014 from static PDFs to dynamic, structured, and interoperable knowledge objects.<\/p>\n<h3>Docling\u2019s Internal Structure: JSON, Nodes, and Knowledge Graphs<\/h3>\n<p>\ud83d\udd0d\u00a0<strong>Key Points to Cover:<\/strong><\/p>\n<ul>\n<li>\n<p><strong>Everything is a Node:<\/strong>\u00a0In Docling, both linguistic elements and text segments are stored as nodes in a graph. A word, a sentence, a comment, a lexical entry \u2014 all are treated as objects that can be linked. This uniform \u201cnode\u201d abstraction means even metadata or annotations can be nodes. For example, an author name could be a node that links to a bibliography entry node or an affiliation node. In a linguistic context, a morpheme or gloss is a node that links to others (like a word node or a meaning node). For a research paper, we treated sections and paragraphs as nodes.<\/p>\n<\/li>\n<li>\n<p><strong>JSON-Based Format:<\/strong>\u00a0Docling stores all data in clean, human-readable JSON. Each object\/node has fields such as an\u00a0<code>id<\/code>\u00a0(unique identifier),\u00a0<code>type<\/code>\u00a0or\u00a0<code>label<\/code>\u00a0(what kind of node it is), the\u00a0<code>text<\/code>\u00a0content, and potentially references to other nodes (via IDs in a\u00a0<code>links<\/code>\u00a0array). This structure is ideal for versioning, exporting, or feeding into NLP\/LLM pipelines, because it\u2019s both human-readable and machine-readable. For instance, a simplified node representation might look like:<\/p>\n<pre><code class=\"python\">{     \"id\": \"node123\",     \"type\": \"text\",     \"label\": \"Sentence\",     \"text\": \"The quick brown fox jumps over the lazy dog.\",     \"links\": [\"node124\", \"node125\"] }<\/code><\/pre>\n<p><em>Example:<\/em>\u00a0The above JSON could represent a sentence node with two links (perhaps linking to nodes 124 and 125 which could be its translation or related notes). In practice, Docling\u2019s JSON might have additional fields (like metadata or provenance info), but the principle is that everything is stored as JSON objects. In our project, we leveraged this by storing entire documents as collections of JSON nodes. Because it\u2019s just JSON, we could easily\u00a0<strong>serialize<\/strong>\u00a0or\u00a0<strong>deserialize<\/strong>\u00a0the data, send it over an API, or store it in Git. In fact, Docling\u2019s Python models use Pydantic, so we often used\u00a0<code>model_dump()<\/code>\u00a0and\u00a0<code>model_dump_json()<\/code>\u00a0to get JSON serializations of nodes or whole documents. This made integration with other systems trivial.<\/p>\n<\/li>\n<li>\n<p><strong>Graph Relationships:<\/strong>\u00a0Relationships between nodes can encode various structures:<\/p>\n<ul>\n<li>\n<p><strong>Parent-child<\/strong>\u00a0(hierarchical structure): e.g., a\u00a0<em>paragraph<\/em>\u00a0node might have child nodes for each\u00a0<em>sentence<\/em>, or a\u00a0<em>chapter<\/em>\u00a0node might have children for\u00a0<em>sections<\/em>. In treebanking or syntactic analysis, parent-child links capture the parse tree.<\/p>\n<\/li>\n<li>\n<p><strong>Semantic links:<\/strong>\u00a0e.g., relations like &#171;related-to&#187;, &#171;supports&#187;, &#171;contradicts&#187;. For a scholarly article, you might link a\u00a0<em>Methodology<\/em>\u00a0section node to a\u00a0<em>Data<\/em>\u00a0node with a relation &#171;uses data from&#187;. Or link a\u00a0<em>Conclusion<\/em>\u00a0node to a\u00a0<em>Hypothesis<\/em>\u00a0node with &#171;supports&#187; if the conclusion supports the initial hypothesis.<\/p>\n<\/li>\n<li>\n<p><strong>Alignment links:<\/strong>\u00a0e.g., linking parallel texts or translations. In linguistics, this could link a sentence node in English to its Spanish translation node. In our academic context, we didn\u2019t use this as much, but one could imagine linking an original article node to a node containing an AI-generated summary, for instance.<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>Custom Annotations:<\/strong>\u00a0Users can define their own annotation layers as needed. Docling isn\u2019t limited to a fixed schema. Some examples of annotations one might include:<\/p>\n<ul>\n<li>\n<p>Grammatical tags (for linguistic corpora, e.g., marking noun, verb, tense on nodes representing words)<\/p>\n<\/li>\n<li>\n<p>Glosses or translations of terms<\/p>\n<\/li>\n<li>\n<p>Comments or notes (for peer review or personal notes, attached to any node)<\/p>\n<\/li>\n<li>\n<p>Metadata like source, author, year, confidence scores, etc.<\/p>\n<\/li>\n<\/ul>\n<p>In our use case, we annotated nodes with provenance information \u2014 each DocItem carried a reference to the page number of the PDF it came from, via a\u00a0<code>prov<\/code>\u00a0field. This way, if our AI agent extracted a quote or a fact, we knew exactly which page of the original PDF it was from (which is important for verification). Because Docling\u2019s data model allowed arbitrary fields, we simply added a\u00a0<code>prov: {page_no: X}<\/code>\u00a0field to each node when parsing.<\/p>\n<\/li>\n<li>\n<p><strong>Multimodal Content Support:<\/strong>\u00a0Though focused on text, Docling supports attachments like audio or images and can link these to transcript nodes. For example, a corpus of oral histories might have audio files linked to text transcripts as nodes. While our project dealt mainly with PDFs and text, this feature is great for digital humanists or linguists working with spoken language data \u2014 you can keep the media and the transcription aligned in the graph.<\/p>\n<\/li>\n<li>\n<p><strong>Export Capabilities:<\/strong>\u00a0You can export data from Docling in multiple formats:<\/p>\n<ul>\n<li>\n<p><strong>Complete project JSON:<\/strong>\u00a0The entire corpus or project can be dumped as one JSON (or a folder of JSON files), which is perfect for backups or interchanging data with other systems.<\/p>\n<\/li>\n<li>\n<p><strong>Filtered subsets:<\/strong>\u00a0For instance, you might export just the lexicon nodes, or just a particular branch of the tree (subtree export) if you only want a portion of the data.<\/p>\n<\/li>\n<li>\n<p><strong>CSV:<\/strong>\u00a0Tabular export for lexicons or structured data (e.g., you could export a list of all example sentences with their translations in a CSV).<\/p>\n<\/li>\n<li>\n<p><strong>Static HTML:<\/strong>\u00a0Useful for archiving or sharing, Docling can generate a read-only HTML view of your project (for example, a nicely formatted tree or a dictionary).<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>Visualization Options:<\/strong>\u00a0Docling includes built-in tree and table views. Trees help linguists see sentence structure or help a researcher visually map the logic of an argument. Tables make it easy to scan lexical entries or grammatical paradigms. We found the tree view especially insightful when mapping the structure of arguments in a paper\u2014seeing a visual tree of how evidence nodes branched from hypothesis nodes, for example, helped in both human understanding and designing AI prompts.<\/p>\n<\/li>\n<\/ul>\n<figure class=\"full-width\"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/72b\/844\/7aa\/72b8447aae8732fad82d42fce21c852e.webp\" alt=\"Docling Pipeline from docling paper\" width=\"1400\" height=\"678\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/72b\/844\/7aa\/72b8447aae8732fad82d42fce21c852e.webp 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/72b\/844\/7aa\/72b8447aae8732fad82d42fce21c852e.webp 781w\" loading=\"lazy\" decode=\"async\"\/><\/p>\n<div><figcaption>Docling Pipeline from docling paper<\/figcaption><\/div>\n<\/figure>\n<p>\ud83e\udde0\u00a0<strong>Goal:<\/strong>\u00a0Emphasize that Docling isn&#8217;t just a storage system \u2014 it&#8217;s a modular, machine-readable representation of structured thought, designed for downstream applications in AI, linguistics, and academic publishing. The data model\u2019s transparency (JSON and nodes) means it\u2019s both\u00a0<strong>human-readable<\/strong>\u00a0for collaboration and\u00a0<strong>machine-readable<\/strong>\u00a0for computation.<\/p>\n<h3>Use Case: Docling in Action<\/h3>\n<p>\ud83e\uddea\u00a0<strong>Example Scenario: Research Paper on \u201cDigital Inequality\u201d<\/strong><\/p>\n<p>Imagine a researcher working on a study titled\u00a0<em>&#171;Digital Inequality in Rural Education Systems.&#187;<\/em>\u00a0Rather than submitting it as a static PDF, the researcher imports the document into Docling to structure it as a dynamic knowledge object. Here\u2019s how that might play out:<\/p>\n<p><strong>\ud83e\udde9 Step-by-Step Breakdown:<\/strong><\/p>\n<ul>\n<li>\n<p><strong>Import and Segmentation:<\/strong>\u00a0The original text is ingested into Docling and segmented into its logical parts \u2013 say, introduction, hypotheses, methodology, findings, conclusion, etc. Each paragraph (or even each sentence, depending on granularity) becomes a node in the Docling environment. For instance, the introduction might be a parent node that contains child nodes for each paragraph.<\/p>\n<\/li>\n<li>\n<p><strong>Semantic Tagging and Annotation:<\/strong>\u00a0Key terms like\u00a0<em>\u201cdigital divide,\u201d \u201cinfrastructure,\u201d \u201cpolicy intervention,\u201d<\/em>\u00a0and\u00a0<em>\u201csocioeconomic status\u201d<\/em>\u00a0are identified and tagged. In Docling, you might create nodes for each of these concepts and then link every mention of them throughout the document. These terms become clickable threads running through the study. So if\u00a0<em>\u201cdigital divide\u201d<\/em>\u00a0is mentioned in both the introduction and the findings, both instances link back to the same concept node\u00a0<em>Digital Divide<\/em>\u00a0in the knowledge graph, forming semantic connections across the paper.<\/p>\n<\/li>\n<li>\n<p><strong>Argument Mapping:<\/strong>\u00a0The central hypothesis \u2014\u00a0<em>\u201cAccess to digital infrastructure strongly correlates with academic performance in rural regions\u201d<\/em>\u00a0\u2014 is represented as a central node (perhaps of type\u00a0<em>Hypothesis<\/em>). Supporting evidence nodes (data tables, citations of prior studies, survey results) are added as child nodes underneath, visually connected in a reasoning tree. In a Docling tree view, you would literally see the hypothesis at the center with branches to each piece of evidence supporting it. If there are counter-arguments or conflicting data, those could be nodes as well, linked with a relationship like\u00a0<em>contradicts<\/em>\u00a0or\u00a0<em>challenges<\/em>.<\/p>\n<\/li>\n<li>\n<p><strong>Peer Collaboration:<\/strong>\u00a0Docling\u2019s collaborative features let a co-author or reviewer join in. A co-author might add comments to a specific argument chain, e.g. suggesting an alternative interpretation of the survey results node. Another researcher could even link this study\u2019s corpus to a separate Docling corpus on urban digital policy, drawing connections between ideas across projects (for example, linking the concept node\u00a0<em>Digital Infrastructure<\/em>\u00a0in this rural study to a similar node in the urban study). This cross-project linking can enrich both projects\u2019 context.<\/p>\n<\/li>\n<li>\n<p><strong>Export and Integration:<\/strong>\u00a0Once the paper is structured, it can be exported for various uses. The entire graph of the paper can be\u00a0<strong>exported as a JSON file<\/strong>\u00a0for use in an LLM training pipeline or other AI analysis. (In our case, we actually did this for a test: after structuring a paper, we exported the metadata and content to a JSON and fed it to a language model to see if it could generate a summary. The structured JSON input made the summaries more accurate because the model could see the labeled sections and relationships, not just raw text.) A simplified HTML view of the paper\u2019s graph can also be generated for a public repository or website, allowing readers to interact with the content without using Docling directly. Key metadata (like the fields shown in the earlier JSON example: title, authors, keywords, etc.) can be extracted and fed into indexing engines, so that the article is findable in a digital library with all its details.<\/p>\n<\/li>\n<li>\n<p><strong>Outcome:<\/strong>\u00a0The study is no longer a flat document. It\u2019s now:<\/p>\n<ul>\n<li>\n<p><strong>Searchable:<\/strong>\u00a0Because every piece of information is a node with explicit content and metadata, you can query the corpus (e.g., find all evidence nodes that support a certain type of claim, or search within conclusions across a corpus of papers).<\/p>\n<\/li>\n<li>\n<p><strong>Interactive:<\/strong>\u00a0Readers or reviewers can click through the graph of nodes. One can jump from reading a sentence to seeing related data or definitions instantly.<\/p>\n<\/li>\n<li>\n<p><strong>Modular:<\/strong>\u00a0Pieces of this paper can be detached or remixed. For example, the literature review section (as a subtree of nodes) could be reused or compared with the literature review of another paper on a similar topic.<\/p>\n<\/li>\n<li>\n<p><strong>Integrated with a broader ecosystem of knowledge:<\/strong>\u00a0Since the content is structured and linked, it can connect to external corpora or knowledge bases. Our imagined\u00a0<em>Digital Inequality<\/em>\u00a0corpus could link to data sets, to policy documents, or to educational resources, making the paper a living part of a knowledge network rather than an isolated PDF on someone\u2019s hard drive.<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>\ud83c\udfaf\u00a0<strong>Goal:<\/strong>\u00a0Illustrate how Docling transforms a traditional academic article into a living, connected artifact \u2014 one that supports richer interpretation, reuse, collaboration, and AI-driven exploration.<\/p>\n<h3>Connection to LLMs and AI<\/h3>\n<p>Docling\u2019s structured approach to text makes it particularly powerful in the age of AI and large language models:<\/p>\n<ul>\n<li>\n<p><strong>Structured corpora for training and analysis:<\/strong>\u00a0Large Language Models (LLMs) learn better from structured, annotated corpora. Docling can provide a trove of well-structured text: for example, a treebank of sentences for a low-resource language, or a graph of arguments and evidence from academic papers. Instead of feeding an LLM raw text, we could feed it JSON from Docling that explicitly labels sections, relationships, and metadata. This could improve fine-tuning processes or prompt engineering, as the model can be guided by the structure (e.g., \u201cuse the Conclusion nodes of papers as training data for summarization\u201d). The clear delineation of pieces (titles, abstracts, etc.) means we can easily assemble training sets of just those pieces.<\/p>\n<\/li>\n<li>\n<p><strong>Docling as a reasoning interface for LLMs:<\/strong>\u00a0One intriguing idea is to use Docling in combination with an LLM to perform logical reasoning or question-answering. Since Docling provides a knowledge graph of a document (or multiple documents), an LLM can navigate that graph via tools. In our project\u2019s prototype, we set up an LLM-based agent with tools that knew how to fetch parts of the Docling graph (like \u201cget_title\u201d, \u201cget_authors\u201d, or even \u201cfind_node containing X\u201d). The agent didn\u2019t have to read the entire text blindly; it could ask for specific pieces via those tools. This MRKL (Mindstorm Reasoning Knowledge Language) approach allowed the LLM to, for example, first retrieve the abstract node, then look for all nodes labeled as\u00a0<em>Conclusion<\/em>, and then form an answer. Docling essentially acted as the\u00a0<strong>database<\/strong>\u00a0and\u00a0<strong>interface<\/strong>\u00a0through which the LLM reasoned about the content. The result was more accurate and traceable, because we knew which nodes the AI consulted for an answer.<\/p>\n<\/li>\n<li>\n<p><strong>Integration pipelines (JSON export to LLM input):<\/strong>\u00a0We\u2019ve shown an example of exporting a Docling project to JSON and using it as an input for an LLM. It\u2019s worth noting that this integration doesn\u2019t have to be custom \u2014 one could imagine a future where Docling has a direct plugin to feed data to an LLM or where an LLM tool is built into Docling for querying the corpus. In our use, we manually orchestrated it with Python code and LangChain, but the principle is general: *Docling provides structured knowledge; LLMs provide reasoning and language generation.*Together they can form an AI that is grounded in a curated knowledge base.<\/p>\n<\/li>\n<\/ul>\n<p>In summary, Docling\u2019s compatibility with AI workflows comes from its\u00a0<strong>structured, open format<\/strong>. JSON outputs and an object model mean you can easily write a script to take all nodes of type \u201cExample Sentence\u201d and send them to a translation model, or use an LLM to fill in missing links in the graph (imagine an AI that suggests new connections between nodes or auto-populates metadata based on content). This symbiosis of Docling and LLMs holds a lot of promise for both academic research and advanced NLP applications.<\/p>\n<h3>Expansion and Publishing<\/h3>\n<p>Finally, let\u2019s consider how Docling scales up to collaborative projects, other tools, and educational settings beyond just individual research use.<\/p>\n<h4>Collaboration Features<\/h4>\n<ul>\n<li>\n<p><strong>Multi-user editing:<\/strong>\u00a0Docling\u2019s structure makes it easy for multiple researchers to work on the same corpus simultaneously. Because each piece of data is discrete (as nodes), collaborators can add or edit different parts of the project without stepping on each other\u2019s toes. For instance, one person can be annotating examples in the corpus while another is writing commentary nodes elsewhere. (On a technical note, since everything is JSON, using version control like Git allows merging changes from different contributors relatively easily.)<\/p>\n<\/li>\n<li>\n<p><strong>Version control:<\/strong>\u00a0Because data is stored as text-based JSON files, changes can be tracked using Git or similar systems. Every edit to a node shows up as a diff. In our experience, we put our Docling project under Git, and we could see line-by-line what changed in the JSON after each editing session. This transparency is fantastic for academic teams who need to maintain a clear history of how data (or analyses) evolve over time.<\/p>\n<\/li>\n<li>\n<p><strong>Peer-review mode:<\/strong>\u00a0Imagine a mode where a reviewer or editor can leave structured comments directly linked to specific nodes (rather than margin notes on a PDF). Docling can support this by treating reviewer comments as nodes linked to the sections or sentences they pertain to. This way, a peer review becomes part of the knowledge graph \u2013 the comment is anchored to the exact point of critique, and the author can address it, even track its resolution through versions.<\/p>\n<\/li>\n<\/ul>\n<h4>Integration with Other Tools<\/h4>\n<ul>\n<li>\n<p><strong>Zotero:<\/strong>\u00a0Citation managers like Zotero could integrate with Docling. For example, you could export all references from a Docling project (since they might be nodes of type\u00a0<em>Reference<\/em>) to a format Zotero understands, or vice versa, import from Zotero to Docling. This helps keep your Docling corpus and your bibliography manager in sync.<\/p>\n<\/li>\n<li>\n<p><strong>ELAN and FLEx:<\/strong>\u00a0Many linguists use tools like ELAN (for annotating audio\/video transcripts) and FLEx (FieldWorks Language Explorer for building dictionaries and grammars). Docling can serve as a unifying platform by importing data from these tools. If you have an ELAN transcript of an interview, you could import it into Docling\u2019s graph structure to link parts of the transcript to linguistic notes or translations. FLEx lexicons could be imported as Docling lexicon nodes. Conversely, Docling\u2019s data could be exported to these formats if needed for analysis in those tools.<\/p>\n<\/li>\n<li>\n<p><strong>Obsidian &amp; Notion:<\/strong>\u00a0Personal knowledge management systems like Obsidian or Notion thrive on linking notes \u2013 something Docling does inherently. With Docling\u2019s lightweight JSON exports or possible API endpoints, you could sync data to an Obsidian vault or a Notion database. For example, each Docling node could become a note in Obsidian (with backlinks corresponding to Docling links). This would bridge academic research data and personal notes, allowing researchers to move seamlessly between a Docling corpus and their broader note-taking system.<\/p>\n<\/li>\n<\/ul>\n<h4>Educational Potential<\/h4>\n<ul>\n<li>\n<p><strong>Interactive assignments:<\/strong>\u00a0Docling can be used in the classroom. Instead of giving students a static text, a teacher could give them a Docling corpus of an article or a short story. Students could be asked to explore it non-linearly \u2013 for instance, follow the argument nodes and identify where the argument might have weaknesses, or find all definitions of key terms via the graph. This trains a more critical, exploratory reading habit.<\/p>\n<\/li>\n<li>\n<p><strong>Custom teaching corpora:<\/strong>\u00a0In language learning or linguistics education, an instructor can build a tailored corpus (a set of sentences, a mini-dictionary, etc.) in Docling for the class. Students can then use Docling to see the structure of sentences (treebanks), check glosses of words, hear audio linked to text, etc. It becomes a hands-on learning platform.<\/p>\n<\/li>\n<li>\n<p><strong>Case studies:<\/strong>\u00a0An academic article ingested into Docling can become an interactive learning module. For example, a history paper in Docling might allow students to click on a date and see a timeline, or click on a reference and see the full source. Each section of the paper could link to background information nodes, dataset nodes, or multimedia. Essentially, Docling can turn a linear text into a multimedia knowledge experience, which is great for teaching complex material.<\/p>\n<\/li>\n<\/ul>\n<p>\ud83e\udde0\u00a0<strong>Goal:<\/strong>\u00a0Show that Docling is not just a technical parsing tool, but also a collaboration platform, an educational resource, and an archival framework for future-oriented research workflows. Its uses extend from solo researchers structuring their notes, to teams building shared knowledge graphs, to teachers creating interactive content for students.<\/p>\n<h3>Conclusion: Why Docling Matters<\/h3>\n<p><strong>\u2705 What We Especially Liked<\/strong><\/p>\n<ul>\n<li>\n<p><em>Modular structure and JSON backbone:<\/em>\u00a0The platform\u2019s reliance on simple JSON and discrete nodes makes it incredibly flexible and interoperable. We were able to integrate Docling into our existing workflows with minimal friction, and we know we can always export our data and use it elsewhere if needed. Machine-readability by default is a huge plus in an era of data-driven research.<\/p>\n<\/li>\n<li>\n<p><em>Graph-based interface:<\/em>\u00a0Navigating complex research relationships and syntactic structures is intuitive with Docling\u2019s graph view. It\u2019s like having a mind-map of your document. This visual and structural approach is a refreshing change from the typical linear doc editor and unveils connections you might otherwise miss.<\/p>\n<\/li>\n<li>\n<p><em>Open-source and lightweight:<\/em>\u00a0Docling is open-source and not a massive enterprise system, which means it\u2019s easy to adopt, fork, or extend. We appreciate not being locked into a vendor and being able to contribute or customize the platform to fit our needs (for example, we wrote custom scripts on top of Docling\u2019s data model, which was feasible because the code and data format are accessible).<\/p>\n<\/li>\n<li>\n<p><em>Multidisciplinary scope:<\/em>\u00a0Docling appeals to linguists, AI researchers, digital humanists, and educators alike. Whether you\u2019re documenting an endangered language, parsing legal documents, or teaching a literature class, the core idea of node-based, annotated text is universally useful. It\u2019s rare to find a tool that can cater to such different audiences without feeling too narrow or too generic \u2014 Docling strikes a good balance.<\/p>\n<\/li>\n<\/ul>\n<p><strong>\ud83d\udee0\ufe0f What Could Be Improved<\/strong><\/p>\n<ul>\n<li>\n<p><em>Documentation:<\/em>\u00a0While the platform is conceptually elegant, its onboarding could benefit from more tutorials, templates, and real-world examples. We had to read through some code and community examples to fully grasp the best practices. For wider adoption, a gentle introduction (maybe a \u201cDocling for Dummies\u201d guide or more sample projects) would be valuable.<\/p>\n<\/li>\n<li>\n<p><em>User Interface (UI):<\/em>\u00a0The current UI, though functional, may feel a bit bare-bones or technical for less tech-savvy users. A more polished, user-friendly interface (without losing the advanced features) could broaden Docling\u2019s adoption. For instance, more drag-and-drop actions, or a guided mode for newbies, would help non-programmers embrace the tool.<\/p>\n<\/li>\n<li>\n<p><em>Localization and accessibility:<\/em>\u00a0Better internationalization (i18n) support would make Docling more inclusive for non-English or multi-lingual research communities. Also, ensuring the interface meets accessibility standards (for users with disabilities) could improve its utility in educational settings.<\/p>\n<\/li>\n<\/ul>\n<p><strong>\ud83e\udd14 Why We Chose Docling:<\/strong>\u00a0In the context of building a research publication platform, we needed a tool that could\u00a0<em>structure<\/em>\u00a0scientific thought, not just store it. Docling stood out as a rare blend of linguistic depth, technical openness, and adaptability to modern AI workflows. Its potential as a reasoning interface, not just a text editor, aligned closely with our goal of developing next-generation academic tools that treat knowledge as dynamic and connected.<\/p>\n<p><strong>\ud83d\udd2d What\u2019s Next:<\/strong>\u00a0Going forward, we plan to do a pilot implementation in our journal\u2019s submission pipeline, using Docling to parse and enrich selected articles. This means authors might submit a paper and get back a Docling representation alongside the usual PDF, which could then be used in peer review or in generating enhanced publication formats. We\u2019re also building a custom JSON-to-LLM pipeline: essentially, using Docling\u2019s export to feed an LLM that helps with semantic analysis (like automated highlights or consistency checks in a paper). Additionally, we are exploring developing teaching modules based on Docling for training students in argument mapping and linguistic analysis \u2013 imagine a classroom where students collaboratively annotate a text in Docling and then query it with an AI assistant.<\/p>\n<p><strong>\ud83c\udfaf Final Thought:<\/strong>\u00a0Docling isn\u2019t just another digital tool \u2014 it\u2019s a philosophical shift in how we think about text, structure, and meaning. In a world increasingly driven by large language models and automated workflows, platforms like Docling help us preserve intention, logic, and nuance in our documents \u2014 one node at a time.<\/p>\n<figure class=\"full-width\"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/4b9\/4b8\/486\/4b94b8486a7890df03dd0db43d027159.jpg\" alt=\"\u00a9 Photo of the author of this article \u2014 Mykhailo Mykhailovich Kapustin.\" title=\"\u00a9 Photo of the author of this article \u2014 Mykhailo Mykhailovich Kapustin.\" width=\"960\" height=\"1280\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/4b9\/4b8\/486\/4b94b8486a7890df03dd0db43d027159.jpg 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/4b9\/4b8\/486\/4b94b8486a7890df03dd0db43d027159.jpg 781w\" loading=\"lazy\" decode=\"async\"\/><\/p>\n<div><figcaption>\u00a9 Photo of the author of this article \u2014 Mykhailo Mykhailovich Kapustin.<\/figcaption><\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n<p><!----><!----><\/div>\n<p><!----><!----><br \/> \u0441\u0441\u044b\u043b\u043a\u0430 \u043d\u0430 \u043e\u0440\u0438\u0433\u0438\u043d\u0430\u043b \u0441\u0442\u0430\u0442\u044c\u0438 <a href=\"https:\/\/habr.com\/ru\/articles\/935584\/\"> https:\/\/habr.com\/ru\/articles\/935584\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<div><!--[--><!--]--><\/div>\n<div id=\"post-content-body\">\n<div>\n<div class=\"article-formatted-body article-formatted-body article-formatted-body_version-2\">\n<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<figure class=\"full-width\">\n<div><figcaption>Docling in Working with Texts, Languages, and Knowledge<\/figcaption><\/div>\n<\/figure>\n<p>Hi everyone. In the context of our research project, we were solving the problem of automating academic submission workflows, which led us to discover a platform called\u00a0<strong>Docling<\/strong>.<\/p>\n<p>Together, we explore the role of Docling in reshaping how research data can be represented, reused, and reasoned over in both human and machine-readable formats.<\/p>\n<p>As part of the development of a scientific research project \u201cAdvanced Scientific Research Projects\u201d (ASRP) aimed at creating a new-generation academic journal named\u00a0<a href=\"https:\/\/asrp.science\/en\" rel=\"noopener noreferrer nofollow\">ASRP.science<\/a>, we encountered a number of methodological and technical challenges. One of the key issues was the automation of parsing academic research documents in order to streamline the submission, processing, and archiving of materials provided by researchers. Modern digital publications increasingly require tools that not only accept text submissions but also structure knowledge, annotate content, handle metadata, terminology, and interlinked concepts, and ensure that outputs are machine-readable for downstream use.<\/p>\n<p>During the initial stages of analysis and prototyping, we formulated the following goal:<\/p>\n<blockquote>\n<p><em>Develop or adapt a platform capable of effectively handling linguistically and conceptually annotated texts, compatible with academic publication formats, and supporting export and interoperability with LLM pipelines and knowledge bases.<\/em><\/p>\n<\/blockquote>\n<p>While exploring available solutions, we came across an open-source project called\u00a0<strong>Docling<\/strong>, designed for linguists, researchers, and digital humanists. Although originally built for working with language and text in a linguistic context, Docling turned out to be surprisingly well-aligned with our needs.<\/p>\n<p><strong>Why Docling Was a Relevant Choice:<\/strong><\/p>\n<ul>\n<li>\n<p><strong>Structured data format (JSON):<\/strong>\u00a0Docling stores data in a structured, lightweight JSON format, which integrates easily into research pipelines and software development workflows. This made it easy for us to feed Docling outputs into other tools.<\/p>\n<\/li>\n<li>\n<p><strong>Graph- and tree-based knowledge representation:<\/strong>\u00a0It supports graph and tree representations of knowledge, crucial for semantic parsing and linking concepts. A research paper in Docling isn\u2019t just text; it becomes a network of interrelated nodes (e.g., sections, definitions, examples).<\/p>\n<\/li>\n<li>\n<p><strong>Flexible corpora creation:<\/strong>\u00a0We can create flexible corpora, including lexical databases or grammatical descriptions, within the same environment. This was useful for building a \u201clanguage archive\u201d of terms and definitions encountered in our papers.<\/p>\n<\/li>\n<li>\n<p><strong>Visualization and extensibility:<\/strong>\u00a0Docling offers visualization capabilities (tree views, tables) and is modular\/extensible. We could visualize argument structures or sentence parse trees, then extend the platform with custom scripts.<\/p>\n<\/li>\n<li>\n<p><strong>Interface with AI systems:<\/strong>\u00a0Perhaps most interestingly, Docling can serve as an interface for interaction with AI systems, including LLMs. The structured outputs (in JSON) can be fed into machine learning pipelines or used to improve prompt engineering by providing context in a structured way.<\/p>\n<\/li>\n<\/ul>\n<p>In the following sections, we will explore the architecture of Docling, its core features, example use cases, and the platform\u2019s potential for adaptation within our research infrastructure.<\/p>\n<h3>Docling in Academic Submission Workflows<\/h3>\n<p><strong>\ud83e\udde0 Key Points to Cover:<\/strong><\/p>\n<ul>\n<li>\n<p><strong>Automatic structuring of research content:<\/strong>\u00a0Docling allows users to break down complex research documents into modular, interlinked nodes (e.g. hypotheses, arguments, citations, definitions) instead of treating a document as one big block of text. For example, in our project pipeline we use Docling\u2019s API to parse an uploaded PDF into a graph of\u00a0<em>nodes<\/em>. Each node might correspond to a section heading, a paragraph, a bibliography entry, etc., all linked together by the document\u2019s structure. Programmatically, this looks like:<\/p>\n<\/li>\n<\/ul>\n<pre><code class=\"python\">from docling.document_converter import DocumentConverter  # Parse a PDF file into a DoclingDocument and iterate over its nodes converter = DocumentConverter() docling_doc = converter.convert(\"sample_article.pdf\").document for node, _ in docling_doc.iterate_items():     data = node.model_dump()      # convert DocItem to a Python dict     print(f\"{data['label']}: {data['text'][:50]}...\")<\/code><\/pre>\n<p><em>Code example: Using Docling to convert a PDF into structured nodes.<\/em>\u00a0In this snippet, each\u00a0<code>node<\/code>\u00a0is a Docling\u00a0<strong>DocItem<\/strong>with a\u00a0<code>label<\/code>\u00a0(type of content, e.g. Title, Heading, Paragraph) and\u00a0<code>text<\/code>\u00a0content. This kind of automated segmentation means a submitted paper is not just a single blob of text; it\u2019s a hierarchy of interrelated parts. For instance, running the above on a sample article prints something like:<\/p>\n<pre><code class=\"python\">Title: Forecasting Social, Geopolitical, and Economic Events Using the 'Banchenko-Technology' Paragraph: The unconscious can be understood as an entity subject to certain laws and in a dynamic relationship with consciousness...<\/code><\/pre>\n<p>Each line represents a node (with its type and an excerpt of text). Under the hood, Docling has taken the document and broken it into pieces, classifying them (e.g., that first node was recognized as a Title).<\/p>\n<ul>\n<li>\n<p><strong>Text + metadata hybrid:<\/strong>\u00a0Each node in Docling can contain text (the content of that segment, such as a paragraph or example sentence)\u00a0<em>and<\/em>\u00a0associated metadata. Metadata might include the author, source, tags, time period, related terms, or any custom fields you define. This is ideal for academic articles that need to be parsed, sorted, and indexed. In our pipeline, for instance, we could attach metadata like page numbers or confidence scores to each node. Docling\u2019s data model (built on Pydantic) makes it easy to handle these metadata fields as Python objects. (In code, we simply call\u00a0<code>node.model_dump()<\/code>\u00a0to get a JSON-ready dict of all of a node\u2019s fields, including text and any annotations.)<\/p>\n<\/li>\n<li>\n<p><strong>Knowledge graph of the paper:<\/strong>\u00a0A submitted paper is not just a static PDF \u2014 once in Docling, it becomes a mini knowledge graph. Sections, arguments, and concepts are connected and queryable. For example, a\u00a0<em>Conclusion<\/em>section node might have links to multiple\u00a0<em>Result<\/em>\u00a0section nodes that it references. Docling inherently supports linking any node to any other, allowing us to represent relationships like \u201cFigure X illustrates Theory Y\u201d or \u201cDefinition A is used in Section B\u201d. The result is that the linear document transforms into a network of information. Reviewers (or algorithms) can trace the logic of an argument through this tree\/graph structure, instead of being confined to linear reading. This was a huge plus for us in thinking about machine-assisted peer review.<\/p>\n<\/li>\n<li>\n<p><strong>Better peer-review preparation:<\/strong>\u00a0Because of the structured, node-based approach, reviewers can traverse the argumentation structure more intuitively. For example, they can quickly isolate the central hypothesis node and see all evidence nodes linked to it, rather than hunting through the text. This tree-like structure supports critical thinking and logic validation by making the paper\u2019s knowledge graph explicit.<\/p>\n<\/li>\n<li>\n<p><strong>Interoperability with research platforms:<\/strong>\u00a0Docling\u2019s JSON structure can be easily integrated into other tools:<\/p>\n<ul>\n<li>\n<p><strong>Research management tools (like Zotero):<\/strong>\u00a0We can export bibliographic metadata or structured references from Docling and import them into citation managers.<\/p>\n<\/li>\n<li>\n<p><strong>Semantic indexing engines:<\/strong>\u00a0Because each piece of content is a node with metadata, we can feed the collection into a semantic search or indexing system. For instance, one could dump all Docling nodes into an Elasticsearch index, enabling fine-grained search (find all occurrences of a certain concept, or all evidence supporting a given claim).<\/p>\n<\/li>\n<li>\n<p><strong>Machine learning pipelines:<\/strong>\u00a0Perhaps most significantly, the clean JSON output can feed ML pipelines for automated analysis or review support. We experimented with an AI agent that consumes Docling-structured data to extract insights. For example, given a Docling-parsed document, our AI agent can answer questions like\u00a0<em>\u201cWhat are the author affiliations?\u201d<\/em>\u00a0by traversing the JSON structure. We simply provide the agent with the structured text and ask for what we need. In code, it looked like this:<\/p>\n<pre><code class=\"python\">query = \"Extract all metadata from the document and return a single JSON object.\" input_data = {     \"messages\": [(\"user\", query)],     \"file_path\": \"sample_article.pdf\" } # Invoke the LLM agent with the structured document as context result_state = agent_executor.ainvoke(input_data) final_answer = result_state[\"messages\"][-1].content metadata = json.loads(final_answer)<\/code><\/pre>\n<p><em>Code example: Using a LangChain-powered agent to query a Docling document.<\/em>\u00a0In our FastAPI service, we pass the user\u2019s query and the document\u2019s file path into a LangChain agent (<code>agent_executor<\/code>). The agent has been configured with tools that interface with Docling \u2013 for example, one tool can get the document\u2019s title, another can get the authors, etc., all by utilizing the Docling-parsed content under the hood. The agent\u2019s final answer (here,\u00a0<code>final_answer<\/code>) is a JSON string. We then parse it into a Python dict\u00a0<code>metadata<\/code>. This\u00a0<strong>metadata JSON<\/strong>\u00a0is the result of the LLM analyzing the Docling structure of the document. It might look something like:<\/p>\n<pre><code class=\"python\">{     \"title\": \"Forecasting Social, Geopolitical, and Economic Events Using the 'Banchenko-Technology'\",     \"authors\": [         \"Denis Banchenko\",         \"Mykhailo Kapustin\"     ],     \"abstract\": \"This article presents a study on the interdependence between subjective experience gained through lucid dreaming and objectively observable processes dependent on collective consciousness. Relying on theoretical research in the field of consciousness, collective conscious and unconscious, and experimental data, assumptions about the nature of such interrelation have been made. A concept is proposed for discussion on the formation of structures capable of utilizing such phenomena for controlled types of activities such as: making economic-political decisions, managing investments, and shaping social transformations. The article introduces a digital system for managing event forecasts and market trends developed by BlackRock<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-470125","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/470125","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=470125"}],"version-history":[{"count":0,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/470125\/revisions"}],"wp:attachment":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=470125"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=470125"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=470125"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}