Getting started
This guide provides a brief introduction to authenticating with and using the API for document data extraction. Feel free to get in touch with us at support@datawollet.com if you have any questions.
Base URL
| Environment | URL |
|---|---|
| Production | https://api.datawollet.com |
| Sandbox | https://api.sandbox.datawollet.com |
The sandbox environment supports a subset of documents and is intended for development and testing. All examples in this guide use the production URL.
Documents that are not representative of real documents in terms of logos, layout or heavily redacted content — are unlikely to work due to the nature of the neuro-symbolic AI used by DataWollet. Realistic examples with anonymised and substituted data (e.g. John Doe naming) are available on request.
Supported documents
The API accepts PDF, JPEG, and PNG files up to 4.5 MB. Supported document types include:
- Bank statements
- Utility bills (energy, water, broadband)
- Council tax bills
- Payslips
- Mortgage statements and illustrations
- Insurance proposals and schedules
- Identity documents (passports, driving licences)
PDFs with embedded text will use that text content directly. Scanned images and image-only PDFs use OCR (requires the documents:ocr scope).
DataWollet automatically identifies pages in PDFs without embedded text content and will attempt to identify and correct for page rotation and skew. Scans with significant or varying skew, blurred content, and misaligned items are likely to produce poor results.
Authentication
Obtain an access token using the OAuth2 client credentials grant. You'll need your client_id and client_secret, provided by DataWollet during onboarding.
curl -X POST https://api.datawollet.com/oauth/token \
-H "Content-Type: application/json" \
-d '{
"grant_type": "client_credentials",
"client_id": "your_client_id",
"client_secret": "your_client_secret"
}'
The response includes a Bearer token:
{
"access_token": "eyJhbGciOiJS...",
"token_type": "Bearer",
"expires_in": 3600,
"scope": "documents:initiate documents:write sessions:read sessions:write"
}
Include this token in subsequent requests:
Authorization: Bearer eyJhbGciOiJS...
Basic usage
The API offers two approaches depending on your use case:
Most endpoints expect and require application/json payloads, however document endpoints require multipart/form-data encoding. The file must be sent as a form field named document.
Requests using other content types (such as application/json or raw binary) will be rejected.
Single document extraction
For one-off extraction with no need to persist data, use POST /documents/single. Send a document and receive extracted data immediately in the response. No session is required.
curl -X POST https://api.datawollet.com/documents/single \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: multipart/form-data" \
-F "document=@bank-statement.pdf"
Session-based extraction
For processing multiple documents together, use sessions. This enables you to build a knowledge graph encompassing from several documents over time, and fuse them into a single unified graph.
Session-based flow
1. Start a session
curl -X POST https://api.datawollet.com/session/start \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{}'
The response includes a sessionId and an expiresAt timestamp:
{
"sessionId": "sess_abc123",
"startedAt": "2025-01-15T10:00:00Z",
"expiresAt": "2025-01-15T11:00:00Z"
}
2. Append documents
Submit one or more documents to the session using POST /documents/append. Each request returns a document-specific knowledge graph immediately in the response, so you can begin working with the extracted data straight away.
curl -X POST https://api.datawollet.com/documents/append \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: multipart/form-data" \
-H "Session-ID: sess_abc123" \
-F "document=@bank-statement.pdf"
Repeat for additional documents (payslips, utility bills, etc.) to build up the session.
3. Query the graph
Use the graph endpoints to retrieve extracted data:
GET /graph/sources— list all documents in the session with summary dataGET /graph/sources/{requestId}— retrieve the full knowledge graph for a specific document
4. Fuse the graph
Once you've submitted all documents, call POST /graph/fuse to merge every document's knowledge graph into a single, unified graph. The fused graph deduplicates entities across documents and resolves relationships, giving you a consolidated view of the data.
curl -X POST https://api.datawollet.com/graph/fuse \
-H "Authorization: Bearer $TOKEN" \
-H "Session-ID: sess_abc123"
5. Terminate the session
When you're finished, you can terminate the session which will prevent any further access to the data:
curl -X DELETE https://api.datawollet.com/session \
-H "Authorization: Bearer $TOKEN" \
-H "Session-ID: sess_abc123"
Response structure
Both /documents/single and /documents/append return the same response shape. Here's a trimmed example:
{
"requestId": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4",
"clientDescription": {
"clientId": "qzfpenwbykaptwqapn",
"tenantId": "your-org",
"subjectId": null
},
"inputs": [
{
"filename": "bank-statement.pdf",
"mimeType": "application/pdf",
"sizeInBytes": 204800,
"checksum": "e3b0c44298fc1c149afbf4c8996fb924..."
}
],
"result": {
"status": {
"extraction": "COMPLETE",
"review": "REVIEW_NOT_REQUIRED",
"expectations": {
"errors": [],
"warnings": [],
"info": []
}
},
"document": {
"classIdentifier": "bank-statement",
"className": "Bank statement",
"classConfidence": 95,
"profilesMatched": [
{
"id": "bank-barclays-v1",
"title": "Barclays bank statement",
"version": "1.0.0"
}
]
},
"text": {
"source": "Embedded",
"content": "BARCLAYS BANK UK PLC..."
},
"nodes": [ ... ]
}
}
Extraction status and expectations
The result.status object tells you whether the extraction completed successfully and whether the results require a review.
extraction— alwaysCOMPLETEfor synchronous endpoints (or an error is returned).review— eitherREVIEW_NOT_REQUIREDorREVIEW_PENDING. A pending review indicates low classification confidence, fallback profile usage, or unsatisfied required expectations.
DataWollet has both a fast pipeline used in the API based on symbolic AI, and a slow pipeline using neurosymbolic AI. The slow pipeline is used to analyse new types of document, when changes are detected, and when content is seen for the first time (e.g. new enumerations to be mapped to DataWollet's ontology). The review status is an indication that the slow pipeline will look at the document — it does not guarantee that the document type will be added to the library, it may be excluded as unsuitable.
The expectations object contains three arrays — errors, warnings, and info — each holding issues found during post-extraction validation:
- Errors indicate unsatisfied required expectations or schema validation failures. These suggest the extracted data may be incomplete or unreliable.
- Warnings indicate unsatisfied recommended expectations. The data is likely usable but may benefit from review.
- Info items are advisory and highlight optional expectations that were not met.
Each issue includes a nodeType (e.g. ark:Transaction), an optional nodeId if it relates to a specific node, and a reason describing the issue. For example:
{
"errors": [],
"warnings": [
{
"nodeType": "ark:Transaction",
"reason": "Expected at least one transaction with a balance"
}
],
"info": []
}
Session expiry and data retention
Sessions have an expiry time (one hour by default, configurable up to seven days). Once a session expires, personal and sensitive data associated with it is irretrievable by design.
If you need more time, call POST /session/resume before the session expires to extend it:
curl -X POST https://api.datawollet.com/session/resume \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{ "sessionId": "sess_abc123" }'
For use cases where data must be retained beyond the session lifecycle — for example, associating documents with a case file or application that requires data to be appended and retrieved over several months — the right approach is to associate the session with an envelope. Envelopes are protected by an additional key that DataWollet does not retain, meaning the API cannot read from or write to an envelope when there is not an active session connected to it.
Envelopes are currently available as a preview. They allow a session to be created using data from a previous session, by using shared backing storage. It is also a mechanism that can be used for sharing and transferring data beteen brokers and their clients, for example, using a single case-specific envelope that is shared between both the broker and their customer(s).
Understanding knowledge graphs
Extraction results are returned as a knowledge graph: an array of typed nodes, each with a unique IRI as its id. Nodes reference each other by IRI to express relationships between entities.
Node IDs within a document response are scoped only to the document and marked as draft IDs. Persistent identifiers are assigned when a fused graph is created - nodes from multiple sources may be collapsed together, and new nodes may be generated during graph fusion (e.g. a credit card account if identified from bank transactions), depending on the client configuration.
For example, extracting a bank statement might produce nodes like:
[
{
"id": "https://graph.datawollet.com/.draft-nodes/a1b2c3d4e5f6",
"type": "ark:Organisation",
"name": "Barclays Bank UK PLC"
},
{
"id": "https://graph.datawollet.com/.draft-nodes/f6e5d4c3b2a1",
"type": "ark:BankAccount",
"accountType": "Current",
"issuedBy": "https://graph.datawollet.com/.draft-nodes/a1b2c3d4e5f6"
},
{
"id": "https://graph.datawollet.com/.draft-nodes/1a2b3c4d5e6f",
"type": "ark:Statement",
"relatedTo": "https://graph.datawollet.com/.draft-nodes/f6e5d4c3b2a1",
"issuedBy": "https://graph.datawollet.com/.draft-nodes/a1b2c3d4e5f6",
"startDate": "2025-01-01",
"endDate": "2025-01-31"
},
{
"id": "https://graph.datawollet.com/.draft-nodes/6f5e4d3c2b1a",
"type": "ark:Transaction",
"sequence": "20250115001000000001",
"date": "2025-01-15",
"relatedTo": "https://graph.datawollet.com/.draft-nodes/f6e5d4c3b2a1",
"fullDescriptor": "B JONES (Faster Payments) Reference: RENT JAN",
"description": "B JONES (Faster Payments)",
"reference": "RENT JAN",
"methodCategory": "TRANSFER:FASTER",
"amount": {
"type": "ark:CurrencyValue",
"currency": "GBP",
"amount": "750.00",
"direction": "Debit"
}
}
]
Notice how nodes link to each other using their id IRIs — the ark:BankAccount references the ark:Organisation via issuedBy, and both the ark:Statement and ark:Transaction reference the account via relatedTo.
Each node's type corresponds to a schema in the DataWollet data model (e.g. ark:Transaction, ark:Person, ark:BankAccount, ark:Employment). The full list of node types and their properties is available in the Schemas section of the API reference.