Skip to main content

Getting started

This guide provides a brief introduction to authenticating with and using the API for document data extraction. Feel free to get in touch with us at support@datawollet.com if you have any questions.

Base URL

EnvironmentURL
Productionhttps://api.datawollet.com
Sandboxhttps://api.sandbox.datawollet.com

The sandbox environment supports a subset of documents and is intended for development and testing. All examples in this guide use the production URL.

info

Documents that are not representative of real documents in terms of logos, layout or heavily redacted content — are unlikely to work due to the nature of the neuro-symbolic AI used by DataWollet. Realistic examples with anonymised and substituted data (e.g. John Doe naming) are available on request.

Supported documents

The API accepts PDF, JPEG, and PNG files up to 4.5 MB. Supported document types include:

  • Bank statements
  • Utility bills (energy, water, broadband)
  • Council tax bills
  • Payslips
  • Mortgage statements and illustrations
  • Insurance proposals and schedules
  • Identity documents (passports, driving licences)

PDFs with embedded text will use that text content directly. Scanned images and image-only PDFs use OCR (requires the documents:ocr scope).

info

DataWollet automatically identifies pages in PDFs without embedded text content and will attempt to identify and correct for page rotation and skew. Scans with significant or varying skew, blurred content, and misaligned items are likely to produce poor results.

Authentication

Obtain an access token using the OAuth2 client credentials grant. You'll need your client_id and client_secret, provided by DataWollet during onboarding.

curl -X POST https://api.datawollet.com/oauth/token \
-H "Content-Type: application/json" \
-d '{
"grant_type": "client_credentials",
"client_id": "your_client_id",
"client_secret": "your_client_secret"
}'

The response includes a Bearer token:

{
"access_token": "eyJhbGciOiJS...",
"token_type": "Bearer",
"expires_in": 3600,
"scope": "documents:initiate documents:write sessions:read sessions:write"
}

Include this token in subsequent requests:

Authorization: Bearer eyJhbGciOiJS...

Basic usage

The API offers two approaches depending on your use case:

warning

Most endpoints expect and require application/json payloads, however document endpoints require multipart/form-data encoding. The file must be sent as a form field named document. Requests using other content types (such as application/json or raw binary) will be rejected.

Single document extraction

For one-off extraction with no need to persist data, use POST /documents/single. Send a document and receive extracted data immediately in the response. No session is required.

curl -X POST https://api.datawollet.com/documents/single \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: multipart/form-data" \
-F "document=@bank-statement.pdf"

Session-based extraction

For processing multiple documents together, use sessions. This enables you to build a knowledge graph encompassing from several documents over time, and fuse them into a single unified graph.

Session-based flow

1. Start a session

curl -X POST https://api.datawollet.com/session/start \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{}'

The response includes a sessionId and an expiresAt timestamp:

{
"sessionId": "sess_abc123",
"startedAt": "2025-01-15T10:00:00Z",
"expiresAt": "2025-01-15T11:00:00Z"
}

2. Append documents

Submit one or more documents to the session using POST /documents/append. Each request returns a document-specific knowledge graph immediately in the response, so you can begin working with the extracted data straight away.

curl -X POST https://api.datawollet.com/documents/append \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: multipart/form-data" \
-H "Session-ID: sess_abc123" \
-F "document=@bank-statement.pdf"

Repeat for additional documents (payslips, utility bills, etc.) to build up the session.

3. Query the graph

Use the graph endpoints to retrieve extracted data:

  • GET /graph/sources — list all documents in the session with summary data
  • GET /graph/sources/{requestId} — retrieve the full knowledge graph for a specific document

4. Fuse the graph

Once you've submitted all documents, call POST /graph/fuse to merge every document's knowledge graph into a single, unified graph. The fused graph deduplicates entities across documents and resolves relationships, giving you a consolidated view of the data.

curl -X POST https://api.datawollet.com/graph/fuse \
-H "Authorization: Bearer $TOKEN" \
-H "Session-ID: sess_abc123"

5. Terminate the session

When you're finished, you can terminate the session which will prevent any further access to the data:

curl -X DELETE https://api.datawollet.com/session \
-H "Authorization: Bearer $TOKEN" \
-H "Session-ID: sess_abc123"

Response structure

Both /documents/single and /documents/append return the same response shape. Here's a trimmed example:

{
"requestId": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4",
"clientDescription": {
"clientId": "qzfpenwbykaptwqapn",
"tenantId": "your-org",
"subjectId": null
},
"inputs": [
{
"filename": "bank-statement.pdf",
"mimeType": "application/pdf",
"sizeInBytes": 204800,
"checksum": "e3b0c44298fc1c149afbf4c8996fb924..."
}
],
"result": {
"status": {
"extraction": "COMPLETE",
"review": "REVIEW_NOT_REQUIRED",
"expectations": {
"errors": [],
"warnings": [],
"info": []
}
},
"document": {
"classIdentifier": "bank-statement",
"className": "Bank statement",
"classConfidence": 95,
"profilesMatched": [
{
"id": "bank-barclays-v1",
"title": "Barclays bank statement",
"version": "1.0.0"
}
]
},
"text": {
"source": "Embedded",
"content": "BARCLAYS BANK UK PLC..."
},
"nodes": [ ... ]
}
}

Extraction status and expectations

The result.status object tells you whether the extraction completed successfully and whether the results require a review.

  • extraction — always COMPLETE for synchronous endpoints (or an error is returned).
  • review — either REVIEW_NOT_REQUIRED or REVIEW_PENDING. A pending review indicates low classification confidence, fallback profile usage, or unsatisfied required expectations.
info

DataWollet has both a fast pipeline used in the API based on symbolic AI, and a slow pipeline using neurosymbolic AI. The slow pipeline is used to analyse new types of document, when changes are detected, and when content is seen for the first time (e.g. new enumerations to be mapped to DataWollet's ontology). The review status is an indication that the slow pipeline will look at the document — it does not guarantee that the document type will be added to the library, it may be excluded as unsuitable.

The expectations object contains three arrays — errors, warnings, and info — each holding issues found during post-extraction validation:

  • Errors indicate unsatisfied required expectations or schema validation failures. These suggest the extracted data may be incomplete or unreliable.
  • Warnings indicate unsatisfied recommended expectations. The data is likely usable but may benefit from review.
  • Info items are advisory and highlight optional expectations that were not met.

Each issue includes a nodeType (e.g. ark:Transaction), an optional nodeId if it relates to a specific node, and a reason describing the issue. For example:

{
"errors": [],
"warnings": [
{
"nodeType": "ark:Transaction",
"reason": "Expected at least one transaction with a balance"
}
],
"info": []
}

Session expiry and data retention

Sessions have an expiry time (one hour by default, configurable up to seven days). Once a session expires, personal and sensitive data associated with it is irretrievable by design.

If you need more time, call POST /session/resume before the session expires to extend it:

curl -X POST https://api.datawollet.com/session/resume \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{ "sessionId": "sess_abc123" }'

For use cases where data must be retained beyond the session lifecycle — for example, associating documents with a case file or application that requires data to be appended and retrieved over several months — the right approach is to associate the session with an envelope. Envelopes are protected by an additional key that DataWollet does not retain, meaning the API cannot read from or write to an envelope when there is not an active session connected to it.

info

Envelopes are currently available as a preview. They allow a session to be created using data from a previous session, by using shared backing storage. It is also a mechanism that can be used for sharing and transferring data beteen brokers and their clients, for example, using a single case-specific envelope that is shared between both the broker and their customer(s).

Understanding knowledge graphs

Extraction results are returned as a knowledge graph: an array of typed nodes, each with a unique IRI as its id. Nodes reference each other by IRI to express relationships between entities.

warning

Node IDs within a document response are scoped only to the document and marked as draft IDs. Persistent identifiers are assigned when a fused graph is created - nodes from multiple sources may be collapsed together, and new nodes may be generated during graph fusion (e.g. a credit card account if identified from bank transactions), depending on the client configuration.

For example, extracting a bank statement might produce nodes like:

[
{
"id": "https://graph.datawollet.com/.draft-nodes/a1b2c3d4e5f6",
"type": "ark:Organisation",
"name": "Barclays Bank UK PLC"
},
{
"id": "https://graph.datawollet.com/.draft-nodes/f6e5d4c3b2a1",
"type": "ark:BankAccount",
"accountType": "Current",
"issuedBy": "https://graph.datawollet.com/.draft-nodes/a1b2c3d4e5f6"
},
{
"id": "https://graph.datawollet.com/.draft-nodes/1a2b3c4d5e6f",
"type": "ark:Statement",
"relatedTo": "https://graph.datawollet.com/.draft-nodes/f6e5d4c3b2a1",
"issuedBy": "https://graph.datawollet.com/.draft-nodes/a1b2c3d4e5f6",
"startDate": "2025-01-01",
"endDate": "2025-01-31"
},
{
"id": "https://graph.datawollet.com/.draft-nodes/6f5e4d3c2b1a",
"type": "ark:Transaction",
"sequence": "20250115001000000001",
"date": "2025-01-15",
"relatedTo": "https://graph.datawollet.com/.draft-nodes/f6e5d4c3b2a1",
"fullDescriptor": "B JONES (Faster Payments) Reference: RENT JAN",
"description": "B JONES (Faster Payments)",
"reference": "RENT JAN",
"methodCategory": "TRANSFER:FASTER",
"amount": {
"type": "ark:CurrencyValue",
"currency": "GBP",
"amount": "750.00",
"direction": "Debit"
}
}
]

Notice how nodes link to each other using their id IRIs — the ark:BankAccount references the ark:Organisation via issuedBy, and both the ark:Statement and ark:Transaction reference the account via relatedTo.

Each node's type corresponds to a schema in the DataWollet data model (e.g. ark:Transaction, ark:Person, ark:BankAccount, ark:Employment). The full list of node types and their properties is available in the Schemas section of the API reference.