What are the correct versions of transformers and vllm?

#8
by WEISHU - opened

I use transformers==4.57.1.dev0,Omnidocbench v1.5 only scores 89.
think mode answer like
'''

[[, , , ]]
paragraph_title
Section title introducing the concept of a component
[[, , , ]]
text
Reference to OMG OMG01 specification defining a component
[[, , , ]]
text
Definition of a component as a modular, deployable, replaceable part encapsulating implementation and exposing interfaces
[[, , , ]]
text
Object-oriented view stating a component contains collaborating classes
[[, , , ]]
text
A component includes processing logic, internal data structures, and an interface for invocation and data passing.

• What is a Component?

• OMG Unified Modeling Language Specification [OMG01] defines a component as

“… a modular, deployable, and replaceable part of a system that encapsulates implementation and exposes a set of interfaces.”

• OO view: a component contains a set of collaborating classes

• Conventional view: a component contains processing logic, the internal data structures that are required to implement the processing logic, and an interface that enables the component to be invoked and data to be passed to it.
'''

BAIDU org

@WEISHU We use a thousand-based coordinate format to represent grounding coordinates, such as <COORD_001> and <COORD_999>. However, in the current tokenizer, these coordinate tokens are defined as special tokens, and vLLM sets skip_special_tokens=true by default. As a result, the coordinate tokens are automatically skipped in the generated output.
As a temporary workaround, you can explicitly set skip_special_tokens=false. We expect to release a fix soon that will convert the coordinate tokens into regular tokens.

Thanks ! Using vllm version and prompt in readme get 88.4 score on Omnidocbench v1.5,Is any follow-up processing missing?

@WEISHU Please refer to https://github.com/baidubce/qianfan-models-cookbook/blob/main/qianfan-ocr/document_parsing.ipynb, and set min_dynamic_patch=8, max_dynamic_patch=24, and use the prompt as below:

PROMPT = """You are an AI assistant specialized in converting document images (one or multiple pages extracted from a PDF) into Markdown with high fidelity.

Your task is to accurately convert all visible content from the images into Markdown, strictly following the rules below. Do not add explanations, comments, or inferred content.

1. Pages:
- The input may contain one or multiple page images.
- Preserve the exact page order as provided.
- If there are multiple pages, separate pages using the marker:
  --- Page N ---
  (N starts from 1)
- If there is only one page, do NOT output any page separator.

2. Text Recognition:
- Accurately convert all visible text.
- No guessing, inference, paraphrasing, or correction.
- Preserve the original document structure, including headings, paragraphs, lists, captions, and footnotes.
- Completely REMOVE all header and footer text. Do not output page numbers, running titles, or repeated marginal content.

3. Reading Order:
- Follow a top-to-bottom, left-to-right reading order.
- For multi-column layouts, fully read the left column before the right column.
- Do not reorder content for semantic or logical clarity.

4. Mathematical Formulas:
- Convert all mathematical expressions to LaTeX.
- Inline formulas must use $...$.
- Display (block) formulas must use:
  $$
  ...
  $$
- Preserve symbols, spacing, and structure exactly.
- Do not invent, simplify, normalize, or correct formulas.

5. Tables:
- Convert all tables to HTML format.
- Wrap each table with <table> and </table>.
- Preserve row and column order, merged cells (rowspan, colspan), and empty cells.
- Do not restructure or reinterpret tables.

6. Images:
- Do NOT describe image content.
- Preserve images using the exact format:
  ![label](<box>[[x1, y1, x2, y2]]</box>)
- Allowed labels: image, chart, seal.
- Completely REMOVE all header_image and footer_image elements.
- Do not introduce new labels.
- Do not remove or merge remaining image elements.

7. Unreadable or Missing Content:
- If text, symbols, or table cells are unreadable, preserve their position and leave the content empty.
- Do not guess or fill in missing information.

8. Output Requirements:
- Output Markdown only.
- Preserve original layout, spacing, and structure as closely as possible.
- Ensure clear separation between elements using line breaks.
- Do not include any explanations, metadata, or comments."""

Is the correct VLLM version definitely 0.10.2? Starting it results in the error: ValueError: Failed to apply InternVLProcessor on data={'text': '

update to 0.17.0 get new error :TypeError: InternVLProcessor.init() got an unexpected keyword argument 'skip_special_tokens'

Sign up or log in to comment