Network Errors When Downloading Models Despite Egress Settings

Hi,

Due to its limited document parsing capabilities in Foundry, I am exploring the use of an external library called docling .

According to the docling documentation, it utilizes EasyOCR for OCR tasks and Smoldocling for VLM when necessary. However, I am consistently encountering network errors when the library attempts to download models, even though we have configured egress for the following sites:

  • objects.githubusercontent.com
  • huggingface.co
  • cdn-lfs-us-1.hf.co
  • github.com

I have also tried pre-downloading the required models, but they do not seem to be recognized or installed in the appropriate folders.

Has anyone experienced a similar issue or found a solution to this problem? Any advice or guidance would be greatly appreciated.

Thank you in advance for your help!

code

from docling.datamodel.document import ConversionResult
from transforms.api import transform, TransformContext, Input, TransformInput, Output, TransformOutput, configure
from transforms.external.systems import external_systems, Source, ResolvedSource
import pandas as pd
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import DocumentStream
from io import BytesIO

@external_systems(
    github=Source("ri.magritte..source.f17cedfa-0ad9-4e68-9482-55cd375459fb")
)
@transform(
    input_docs=Input("ri.foundry.main.dataset.6d8591d4-33f8-4b52-9c59-5497b263d14e"),
    output=Output("ri.foundry.main.dataset.108ba082-f7c0-4f09-bf80-a62a84f1ed96")
)
def compute(
    ctx: TransformContext,
    github: ResolvedSource,
    input_docs: TransformInput,
    output: TransformOutput
):
    input_filesystem = input_docs.filesystem()
    file_list = input_filesystem.files("**/*")
    
    results = []
    for file_info in file_list.collect():
        pdf_path = file_info.path
        with input_filesystem.open(pdf_path, "rb") as f:
            buf = BytesIO(f.read())
            source = DocumentStream(name=pdf_path, stream=buf)
            converter = DocumentConverter()
            result: ConversionResult = converter.convert(source)
            processed_text = result.document.export_to_markdown()
            results.append({
                "file_path": pdf_path,
                "extracted_text": processed_text
            })
    
    df = pd.DataFrame(results)
    spark_df = ctx.spark_session.createDataFrame(df)
    output.write_dataframe(spark_df)

Error in AIP

The error you're encountering is related to a URLError caused by the message "Name or service not known". This generally indicates that the domain name being accessed cannot be resolved, possibly due to a DNS issue or the URL being incorrect.

Here's the summary and suggestion for fixing this issue:

Summary
The error arises from the attempt to access a URL, likely for downloading a model or another resource via EasyOCR, as indicated in the traceback at easyocr/easyocr.py, line 253. This URL cannot be resolved, which might be due to an incorrect domain, network configuration issue, or missing network connectivity within the environment.

Suggestion
Check the URL: Verify that the URL being used in the code is correct. This can usually be found in the easyocr.py file, specifically around where the download_and_unzip function is being called.
Network Configuration: Ensure that the necessary network access is granted if running in a restricted or isolated network environment.
DNS Settings: Confirm that DNS settings are correctly configured, allowing the server to resolve internet domain names.
Local Testing: If possible, try accessing the URL from a local environment to see if a similar issue arises, which will help determine if the problem is specific to the server's network configuration.
If these steps do not resolve the issue or you cannot identify the problematic configuration, it would be best to consult with Palantir support to further investigate and resolve this server/environment-specific issue.

Can you share the raw error trace, not the AIP Summary. It doesn’t contain the missing hostname.

1 Like

Thank you for your reply.
Here is the error message which doesn’t contain specific url.

After checking the document of Docling, it says that EasyOCR is downloaded from github.

[module version: 3.44.0]

transforms.external.systems._redact_credentials_in_output.URLError: <urlopen error [Errno -2] Name or service not known>

Traceback (most recent call last):
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 1344, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/http/client.py", line 1338, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/http/client.py", line 1384, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/http/client.py", line 1333, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/http/client.py", line 1093, in _send_output
    self.send(msg)
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/http/client.py", line 1037, in send
    self.connect()
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/http/client.py", line 1472, in connect
    super().connect()
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/http/client.py", line 1003, in connect
    self.sock = self._create_connection(
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/socket.py", line 841, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/socket.py", line 978, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
transforms.external.systems._redact_credentials_in_output.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms/_build.py", line 331, in run
    self._transform.compute(**kwargs, **parameters)
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms/api/_transform.py", line 318, in compute
    self(**kwargs)
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms/api/_transform.py", line 226, in __call__
    return self._compute_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/transforms/external/systems/_redact_credentials_in_output.py", line 24, in wrapper
    raise _redact_exception_chain(e, secrets)
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/transforms/external/systems/_redact_credentials_in_output.py", line 21, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__user_code_environment__/__SYMLINKS__/site-packages/myproject/datasets/spark-transform.py", line 34, in compute
    result: ConversionResult = converter.convert(source)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/pydantic/_internal/_validate_call.py", line 39, in wrapper_function
    return wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/pydantic/_internal/_validate_call.py", line 136, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/docling/document_converter.py", line 237, in convert
    return next(all_res)
           ^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/docling/document_converter.py", line 260, in convert_all
    for conv_res in conv_res_iter:
                    ^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/docling/document_converter.py", line 295, in _convert
    for item in map(
                ^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/docling/document_converter.py", line 342, in _process_document
    conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/docling/document_converter.py", line 363, in _execute_pipeline
    pipeline = self._get_pipeline(in_doc.format)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/docling/document_converter.py", line 325, in _get_pipeline
    self.initialized_pipelines[cache_key] = pipeline_class(
                                            ^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/docling/pipeline/standard_pdf_pipeline.py", line 66, in __init__
    ocr_model = self.get_ocr_model(artifacts_path=artifacts_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/docling/pipeline/standard_pdf_pipeline.py", line 154, in get_ocr_model
    return factory.create_instance(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/docling/models/factories/base_factory.py", line 57, in create_instance
    return _cls(options=options, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/docling/models/easyocr_model.py", line 81, in __init__
    self.reader = easyocr.Reader(
                  ^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/easyocr/easyocr.py", line 92, in __init__
    detector_path = self.getDetectorPath(detect_network)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/easyocr/easyocr.py", line 253, in getDetectorPath
    download_and_unzip(self.detection_models[self.detect_network]['url'], self.detection_models[self.detect_network]['filename'], self.model_storage_directory, self.verbose)
  File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/easyocr/utils.py", line 628, in download_and_unzip
    urlretrieve(url, zip_path, reporthook=reporthook)
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 240, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
                            ^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 630, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 553, in error
    result = self._call_chain(*args)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 745, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 515, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 532, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 1392, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/python/urllib/request.py", line 1347, in do_open
    raise URLError(err)
transforms.external.systems._redact_credentials_in_output.URLError: <urlopen error [Errno -2] Name or service not known>