当前位置:   article > 正文

开源项目解读 —— Self-Operating Computer Framework # 长期主义 # 价值_self operating computer

self operating computer

价值:生成主函数业务逻辑函数思维导图,帮助理解,PR到开源项目,希望帮助大家理解IPA工作原理,国内没有好的开源项目,我就来翻译分析解读,给大家抛砖引玉。思维导图打算用文心一言配合其思维导图插件实现。

开源链接:OthersideAI/self-operating-computer: A framework to enable multimodal models to operate a computer. (github.com)


目录

目录

整体代码框架 

核心代码逻辑

capture_screen_with_cursor # 用光标捕获屏幕

 capture_mini_screenshot_with_cursor # 将截图和网格一起保存

add_grid_to_image # 给图像配上网格 

keyboard_type# 用于通过程序模拟键盘输入

 search # 模拟在操作系统中搜索文本。具体来说,它会模拟按下“开始”键(在Windows中)或“Command”和“空格”键(在MacOS中),然后输入提供的文本并按下“Enter”键。

 keyboard_type# 用于通过程序模拟键盘输入

"extract_json_from_string" and "convert_percent_to_decimal"# 从json提取字符与把百分数转换为小数点

draw_label_with_background # 在屏幕上绘制一个网格,并在网格的每个交叉点上添加百分比标签。该函数可以捕获在 Linux 和 macOS 系统上工作。

 click_at_percentage # 在屏幕上点击指定百分比的位置

 mouse_click # 在屏幕上点击指定百分比的位置

  summarize # 用于使用预先训练好的模型来生成摘要。该函数可以捕获屏幕截图并将其作为输入提供给模型。该函数可以尝试使用两个预训练模型:`gpt-4-vision-preview` 和 `gemini-pro-vision`。

  parse_response # 用于该函数解析与 AI 对话交互的响应。该函数可以捕获不同的响应类型,例如点击、输入文本或搜索查询。总之,`parse_response` 函数将响应解析为字典,其中包含一个表示响应类型的字符串和一个与响应类型相关的数据

  get_next_action_from_gemini_pro_vision # 该函数使用预训练的模型`gemini-pro-vision`生成下一个操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。

  get_next_action_from_openai # 该函数使用 OpenAI 的 GPT-4 模型生成下一个操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。

  accurate_mode_double_check # 该函数在精确模式下使用预训练的模型`gpt-4-vision-preview`重新生成操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。向OAI提供以光标为中心的迷你截图的其他截图,以便进一步微调点击位置

  get_last_assistant_message # 该函数从消息数组中检索最后一个来自AI助手的消息

  get_next_action # 该函数根据传入的模型、消息数组、目标对象和精确模式来生成下一个操作。

  format_accurate_mode_vision_prompt # 该函数根据上一次点击的坐标和屏幕尺寸来生成摘要提示

  format_vision_prompt # 该函数根据目标对象和上一次操作来生成摘要提示

   format_summary_prompt # 该函数根据目标对象来生成摘要提示,该函数在summarize函数中作为子函数被调用。

   main # 该函数是Self- Operating Computer的入口点

   validation # 函数用于验证模型、精确模式和语音模式是否正确配置

   ModelNotRecognizedException # 该类继承自基类`Exception`。这个类用于在遇到未识别的模型时引发异常

   keyboard_type# 用于通过程序模拟键盘输入

代码中的Prompt设定

一些常规变量的设定

调用的包

dotenv这个包(库)主要是用来加载环境变量的

xlib这个包(库)主要是用来与Window系统交互的库

prompt_toolkit 这个包(库)是用于在Python中构建功能强大的交互式命令行应用程序的库。基于文本终端的 UI

PyAutoGUI这个包(库)是一个简单易用,跨平台的可以模拟键盘鼠标进行自动操作的 python 库,可实现控制鼠标、键盘、消息框、截图、定位等功能,上能挂机刷宝箱,下能自动写文档.

Pydantic这个包(库)是一个常用的用于数据接口schema定义与检查的库。通过pydantic库,我们可以更为规范地定义和使用数据接口,这对于大型项目的开发将会更为友好。

Pygetwindow这个包(库)提供了一些方法和属性,使得在Python程序中可以轻松地执行各种窗口操作。

业务逻辑

架构-模块



整体代码框架 

核心代码逻辑

capture_screen_with_cursor # 用光标捕获屏幕

  1. def capture_screen_with_cursor(file_path):
  2. user_platform = platform.system()
  3. if user_platform == "Windows":
  4. screenshot = pyautogui.screenshot()
  5. screenshot.save(file_path)
  6. elif user_platform == "Linux":
  7. # Use xlib to prevent scrot dependency for Linux
  8. screen = Xlib.display.Display().screen()
  9. size = screen.width_in_pixels, screen.height_in_pixels
  10. monitor_size["width"] = size[0]
  11. monitor_size["height"] = size[1]
  12. screenshot = ImageGrab.grab(bbox=(0, 0, size[0], size[1]))
  13. screenshot.save(file_path)
  14. elif user_platform == "Darwin": # (Mac OS)
  15. # Use the screencapture utility to capture the screen with the cursor
  16. subprocess.run(["screencapture", "-C", file_path])
  17. else:
  18. print(f"The platform you're using ({user_platform}) is not currently supported")

 capture_mini_screenshot_with_cursor # 将截图和网格一起保存

  1. def capture_mini_screenshot_with_cursor(
  2. file_path=os.path.join("screenshots", "screenshot_mini.png"), x=0, y=0
  3. ):
  4. user_platform = platform.system()
  5. if user_platform == "Linux":
  6. x = float(x[:-1]) # convert x from "50%" to 50.
  7. y = float(y[:-1])
  8. x = (x / 100) * monitor_size[
  9. "width"
  10. ] # convert x from 50 to 0.5 * monitor_width
  11. y = (y / 100) * monitor_size["height"]
  12. # Define the coordinates for the rectangle
  13. x1, y1 = int(x - ACCURATE_PIXEL_COUNT / 2), int(y - ACCURATE_PIXEL_COUNT / 2)
  14. x2, y2 = int(x + ACCURATE_PIXEL_COUNT / 2), int(y + ACCURATE_PIXEL_COUNT / 2)
  15. screenshot = ImageGrab.grab(bbox=(x1, y1, x2, y2))
  16. screenshot = screenshot.resize(
  17. (screenshot.width * 2, screenshot.height * 2), Image.LANCZOS
  18. ) # upscale the image so it's easier to see and percentage marks more visible
  19. screenshot.save(file_path)
  20. screenshots_dir = "screenshots"
  21. grid_screenshot_filename = os.path.join(
  22. screenshots_dir, "screenshot_mini_with_grid.png"
  23. )
  24. add_grid_to_image(
  25. file_path, grid_screenshot_filename, int(ACCURATE_PIXEL_COUNT / 2)
  26. )
  27. elif user_platform == "Darwin":
  28. x = float(x[:-1]) # convert x from "50%" to 50.
  29. y = float(y[:-1])
  30. x = (x / 100) * monitor_size[
  31. "width"
  32. ] # convert x from 50 to 0.5 * monitor_width
  33. y = (y / 100) * monitor_size["height"]
  34. x1, y1 = int(x - ACCURATE_PIXEL_COUNT / 2), int(y - ACCURATE_PIXEL_COUNT / 2)
  35. width = ACCURATE_PIXEL_COUNT
  36. height = ACCURATE_PIXEL_COUNT
  37. # Use the screencapture utility to capture the screen with the cursor
  38. rect = f"-R{x1},{y1},{width},{height}"
  39. subprocess.run(["screencapture", "-C", rect, file_path])
  40. screenshots_dir = "screenshots"
  41. grid_screenshot_filename = os.path.join(
  42. screenshots_dir, "screenshot_mini_with_grid.png"
  43. )
  44. add_grid_to_image(
  45. file_path, grid_screenshot_filename, int(ACCURATE_PIXEL_COUNT / 2)
  46. )

add_grid_to_image # 给图像配上网格 

  1. def add_grid_to_image(original_image_path, new_image_path, grid_interval):
  2. """
  3. Add a grid to an image
  4. """
  5. # Load the image
  6. image = Image.open(original_image_path)
  7. # Create a drawing object
  8. draw = ImageDraw.Draw(image)
  9. # Get the image size
  10. width, height = image.size
  11. # Reduce the font size a bit
  12. font_size = int(grid_interval / 10) # Reduced font size
  13. # Calculate the background size based on the font size
  14. bg_width = int(font_size * 4.2) # Adjust as necessary
  15. bg_height = int(font_size * 1.2) # Adjust as necessary
  16. # Function to draw text with a white rectangle background
  17. def draw_label_with_background(
  18. position, text, draw, font_size, bg_width, bg_height
  19. ):
  20. # Adjust the position based on the background size
  21. text_position = (position[0] + bg_width // 2, position[1] + bg_height // 2)
  22. # Draw the text background
  23. draw.rectangle(
  24. [position[0], position[1], position[0] + bg_width, position[1] + bg_height],
  25. fill="white",
  26. )
  27. # Draw the text
  28. draw.text(text_position, text, fill="black", font_size=font_size, anchor="mm")
  29. # Draw vertical lines and labels at every `grid_interval` pixels
  30. for x in range(grid_interval, width, grid_interval):
  31. line = ((x, 0), (x, height))
  32. draw.line(line, fill="blue")
  33. for y in range(grid_interval, height, grid_interval):
  34. # Calculate the percentage of the width and height
  35. x_percent = round((x / width) * 100)
  36. y_percent = round((y / height) * 100)
  37. draw_label_with_background(
  38. (x - bg_width // 2, y - bg_height // 2),
  39. f"{x_percent}%,{y_percent}%",
  40. draw,
  41. font_size,
  42. bg_width,
  43. bg_height,
  44. )
  45. # Draw horizontal lines - labels are already added with vertical lines
  46. for y in range(grid_interval, height, grid_interval):
  47. line = ((0, y), (width, y))
  48. draw.line(line, fill="blue")
  49. # Save the image with the grid
  50. image.save(new_image_path)

keyboard_type# 用于通过程序模拟键盘输入

  1. def keyboard_type(text):
  2. text = text.replace("\\n", "\n")
  3. for char in text:
  4. pyautogui.write(char)
  5. pyautogui.press("enter")
  6. return "Type: " + text

 search # 模拟在操作系统中搜索文本。具体来说,它会模拟按下“开始”键(在Windows中)或“Command”和“空格”键(在MacOS中),然后输入提供的文本并按下“Enter”键。

  1. def search(text):
  2. if platform.system() == "Windows":
  3. pyautogui.press("win")
  4. elif platform.system() == "Linux":
  5. pyautogui.press("win")
  6. else:
  7. # Press and release Command and Space separately
  8. pyautogui.keyDown("command")
  9. pyautogui.press("space")
  10. pyautogui.keyUp("command")
  11. time.sleep(1)
  12. # Now type the text
  13. for char in text:
  14. pyautogui.write(char)
  15. pyautogui.press("enter")
  16. return "Open program: " + text

 keyboard_type# 用于通过程序模拟键盘输入

  1. def keyboard_type(text):
  2. text = text.replace("\\n", "\n")
  3. for char in text:
  4. pyautogui.write(char)
  5. pyautogui.press("enter")
  6. return "Type: " + text

"extract_json_from_string" and "convert_percent_to_decimal"# 从json提取字符与把百分数转换为小数点

  1. def extract_json_from_string(s):
  2. # print("extracting json from string", s)
  3. try:
  4. # Find the start of the JSON structure
  5. json_start = s.find("{")
  6. if json_start == -1:
  7. return None
  8. # Extract the JSON part and convert it to a dictionary
  9. json_str = s[json_start:]
  10. return json.loads(json_str)
  11. except Exception as e:
  12. print(f"Error parsing JSON: {e}")
  13. return None
  14. def convert_percent_to_decimal(percent_str):
  15. try:
  16. # Remove the '%' sign and convert to float
  17. decimal_value = float(percent_str.strip("%"))
  18. # Convert to decimal (e.g., 20% -> 0.20)
  19. return decimal_value / 100
  20. except ValueError as e:
  21. print(f"Error converting percent to decimal: {e}")
  22. return None

draw_label_with_background # 在屏幕上绘制一个网格,并在网格的每个交叉点上添加百分比标签。该函数可以捕获在 Linux 和 macOS 系统上工作。

  1. def draw_label_with_background(
  2. position, text, draw, font_size, bg_width, bg_height
  3. ):
  4. # Adjust the position based on the background size
  5. text_position = (position[0] + bg_width // 2, position[1] + bg_height // 2)
  6. # Draw the text background
  7. draw.rectangle(
  8. [position[0], position[1], position[0] + bg_width, position[1] + bg_height],
  9. fill="white",
  10. )
  11. # Draw the text
  12. draw.text(text_position, text, fill="black", font_size=font_size, anchor="mm")
  13. # Draw vertical lines and labels at every `grid_interval` pixels
  14. for x in range(grid_interval, width, grid_interval):
  15. line = ((x, 0), (x, height))
  16. draw.line(line, fill="blue")
  17. for y in range(grid_interval, height, grid_interval):
  18. # Calculate the percentage of the width and height
  19. x_percent = round((x / width) * 100)
  20. y_percent = round((y / height) * 100)
  21. draw_label_with_background(
  22. (x - bg_width // 2, y - bg_height // 2),
  23. f"{x_percent}%,{y_percent}%",
  24. draw,
  25. font_size,
  26. bg_width,
  27. bg_height,
  28. )
  29. # Draw horizontal lines - labels are already added with vertical lines
  30. for y in range(grid_interval, height, grid_interval):
  31. line = ((0, y), (width, y))
  32. draw.line(line, fill="blue")
  33. # Save the image with the grid
  34. image.save(new_image_path)

 click_at_percentage # 在屏幕上点击指定百分比的位置

  1. def click_at_percentage(
  2. x_percentage, y_percentage, duration=0.2, circle_radius=50, circle_duration=0.5
  3. ):
  4. # Get the size of the primary monitor
  5. screen_width, screen_height = pyautogui.size()
  6. # Calculate the x and y coordinates in pixels
  7. x_pixel = int(screen_width * float(x_percentage))
  8. y_pixel = int(screen_height * float(y_percentage))
  9. # Move to the position smoothly
  10. pyautogui.moveTo(x_pixel, y_pixel, duration=duration)
  11. # Circular movement
  12. start_time = time.time()
  13. while time.time() - start_time < circle_duration:
  14. angle = ((time.time() - start_time) / circle_duration) * 2 * math.pi
  15. x = x_pixel + math.cos(angle) * circle_radius
  16. y = y_pixel + math.sin(angle) * circle_radius
  17. pyautogui.moveTo(x, y, duration=0.1)
  18. # Finally, click
  19. pyautogui.click(x_pixel, y_pixel)
  20. return "Successfully clicked"

 mouse_click # 在屏幕上点击指定百分比的位置

  1. def mouse_click(click_detail):
  2. try:
  3. x = convert_percent_to_decimal(click_detail["x"])
  4. y = convert_percent_to_decimal(click_detail["y"])
  5. if click_detail and isinstance(x, float) and isinstance(y, float):
  6. click_at_percentage(x, y)
  7. return click_detail["description"]
  8. else:
  9. return "We failed to click"
  10. except Exception as e:
  11. print(f"Error parsing JSON: {e}")
  12. return "We failed to click"

  summarize # 用于使用预先训练好的模型来生成摘要。该函数可以捕获屏幕截图并将其作为输入提供给模型。该函数可以尝试使用两个预训练模型:`gpt-4-vision-preview` 和 `gemini-pro-vision`。

  1. def summarize(model, messages, objective):
  2. try:
  3. screenshots_dir = "screenshots"
  4. if not os.path.exists(screenshots_dir):
  5. os.makedirs(screenshots_dir)
  6. screenshot_filename = os.path.join(screenshots_dir, "summary_screenshot.png")
  7. # Call the function to capture the screen with the cursor
  8. capture_screen_with_cursor(screenshot_filename)
  9. summary_prompt = format_summary_prompt(objective)
  10. if model == "gpt-4-vision-preview":
  11. with open(screenshot_filename, "rb") as img_file:
  12. img_base64 = base64.b64encode(img_file.read()).decode("utf-8")
  13. summary_message = {
  14. "role": "user",
  15. "content": [
  16. {"type": "text", "text": summary_prompt},
  17. {
  18. "type": "image_url",
  19. "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
  20. },
  21. ],
  22. }
  23. # create a copy of messages and save to pseudo_messages
  24. messages.append(summary_message)
  25. response = client.chat.completions.create(
  26. model="gpt-4-vision-preview",
  27. messages=messages,
  28. max_tokens=500,
  29. )
  30. content = response.choices[0].message.content
  31. elif model == "gemini-pro-vision":
  32. model = genai.GenerativeModel("gemini-pro-vision")
  33. summary_message = model.generate_content(
  34. [summary_prompt, Image.open(screenshot_filename)]
  35. )
  36. content = summary_message.text
  37. return content
  38. except Exception as e:
  39. print(f"Error in summarize: {e}")
  40. return "Failed to summarize the workflow"

  parse_response # 用于该函数解析与 AI 对话交互的响应。该函数可以捕获不同的响应类型,例如点击、输入文本或搜索查询。总之,`parse_response` 函数将响应解析为字典,其中包含一个表示响应类型的字符串和一个与响应类型相关的数据

  1. def parse_response(response):
  2. if response == "DONE":
  3. return {"type": "DONE", "data": None}
  4. elif response.startswith("CLICK"):
  5. # Adjust the regex to match the correct format
  6. click_data = re.search(r"CLICK \{ (.+) \}", response).group(1)
  7. click_data_json = json.loads(f"{{{click_data}}}")
  8. return {"type": "CLICK", "data": click_data_json}
  9. elif response.startswith("TYPE"):
  10. # Extract the text to type
  11. try:
  12. type_data = re.search(r"TYPE (.+)", response, re.DOTALL).group(1)
  13. except:
  14. type_data = re.search(r'TYPE "(.+)"', response, re.DOTALL).group(1)
  15. return {"type": "TYPE", "data": type_data}
  16. elif response.startswith("SEARCH"):
  17. # Extract the search query
  18. try:
  19. search_data = re.search(r'SEARCH "(.+)"', response).group(1)
  20. except:
  21. search_data = re.search(r"SEARCH (.+)", response).group(1)
  22. return {"type": "SEARCH", "data": search_data}
  23. return {"type": "UNKNOWN", "data": response}

  get_next_action_from_gemini_pro_vision # 该函数使用预训练的模型`gemini-pro-vision`生成下一个操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。

  1. def get_next_action_from_gemini_pro_vision(messages, objective):
  2. """
  3. Get the next action for Self-Operating Computer using Gemini Pro Vision
  4. """
  5. # sleep for a second
  6. time.sleep(1)
  7. try:
  8. screenshots_dir = "screenshots"
  9. if not os.path.exists(screenshots_dir):
  10. os.makedirs(screenshots_dir)
  11. screenshot_filename = os.path.join(screenshots_dir, "screenshot.png")
  12. # Call the function to capture the screen with the cursor
  13. capture_screen_with_cursor(screenshot_filename)
  14. new_screenshot_filename = os.path.join(
  15. "screenshots", "screenshot_with_grid.png"
  16. )
  17. add_grid_to_image(screenshot_filename, new_screenshot_filename, 500)
  18. # sleep for a second
  19. time.sleep(1)
  20. previous_action = get_last_assistant_message(messages)
  21. vision_prompt = format_vision_prompt(objective, previous_action)
  22. model = genai.GenerativeModel("gemini-pro-vision")
  23. response = model.generate_content(
  24. [vision_prompt, Image.open(new_screenshot_filename)]
  25. )
  26. # create a copy of messages and save to pseudo_messages
  27. pseudo_messages = messages.copy()
  28. pseudo_messages.append(response.text)
  29. messages.append(
  30. {
  31. "role": "user",
  32. "content": "`screenshot.png`",
  33. }
  34. )
  35. content = response.text[1:]
  36. return content
  37. except Exception as e:
  38. print(f"Error parsing JSON: {e}")
  39. return "Failed take action after looking at the screenshot"

  get_next_action_from_openai # 该函数使用 OpenAI 的 GPT-4 模型生成下一个操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。

#(get_next_action_from_gemini_pro_vision没有的操作)最后,如果`accurate_mode`设置为`True`,它将调用`accurate_mode_double_check`函数来检查生成的操作是否准确。如果生成的操作不准确,它将尝试再次运行模型以获得更准确的结果。

  1. def get_next_action_from_openai(messages, objective, accurate_mode):
  2. """
  3. Get the next action for Self-Operating Computer
  4. """
  5. # sleep for a second
  6. time.sleep(1)
  7. try:
  8. screenshots_dir = "screenshots"
  9. if not os.path.exists(screenshots_dir):
  10. os.makedirs(screenshots_dir)
  11. screenshot_filename = os.path.join(screenshots_dir, "screenshot.png")
  12. # Call the function to capture the screen with the cursor
  13. capture_screen_with_cursor(screenshot_filename)
  14. new_screenshot_filename = os.path.join(
  15. "screenshots", "screenshot_with_grid.png"
  16. )
  17. add_grid_to_image(screenshot_filename, new_screenshot_filename, 500)
  18. # sleep for a second
  19. time.sleep(1)
  20. with open(new_screenshot_filename, "rb") as img_file:
  21. img_base64 = base64.b64encode(img_file.read()).decode("utf-8")
  22. previous_action = get_last_assistant_message(messages)
  23. vision_prompt = format_vision_prompt(objective, previous_action)
  24. vision_message = {
  25. "role": "user",
  26. "content": [
  27. {"type": "text", "text": vision_prompt},
  28. {
  29. "type": "image_url",
  30. "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
  31. },
  32. ],
  33. }
  34. # create a copy of messages and save to pseudo_messages
  35. pseudo_messages = messages.copy()
  36. pseudo_messages.append(vision_message)
  37. response = client.chat.completions.create(
  38. model="gpt-4-vision-preview",
  39. messages=pseudo_messages,
  40. presence_penalty=1,
  41. frequency_penalty=1,
  42. temperature=0.7,
  43. max_tokens=300,
  44. )
  45. messages.append(
  46. {
  47. "role": "user",
  48. "content": "`screenshot.png`",
  49. }
  50. )
  51. content = response.choices[0].message.content
  52. if accurate_mode:
  53. if content.startswith("CLICK"):
  54. # Adjust pseudo_messages to include the accurate_mode_message
  55. click_data = re.search(r"CLICK \{ (.+) \}", content).group(1)
  56. click_data_json = json.loads(f"{{{click_data}}}")
  57. prev_x = click_data_json["x"]
  58. prev_y = click_data_json["y"]
  59. if DEBUG:
  60. print(
  61. f"Previous coords before accurate tuning: prev_x {prev_x} prev_y {prev_y}"
  62. )
  63. content = accurate_mode_double_check(
  64. "gpt-4-vision-preview", pseudo_messages, prev_x, prev_y
  65. )
  66. assert content != "ERROR", "ERROR: accurate_mode_double_check failed"
  67. return content
  68. except Exception as e:
  69. print(f"Error parsing JSON: {e}")
  70. return "Failed take action after looking at the screenshot"

  accurate_mode_double_check # 该函数在精确模式下使用预训练的模型`gpt-4-vision-preview`重新生成操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。向OAI提供以光标为中心的迷你截图的其他截图,以便进一步微调点击位置

  1. def accurate_mode_double_check(model, pseudo_messages, prev_x, prev_y):
  2. """
  3. Reprompt OAI with additional screenshot of a mini screenshot centered around the cursor for further finetuning of clicked location
  4. """
  5. print("[get_next_action_from_gemini_pro_vision] accurate_mode_double_check")
  6. try:
  7. screenshot_filename = os.path.join("screenshots", "screenshot_mini.png")
  8. capture_mini_screenshot_with_cursor(
  9. file_path=screenshot_filename, x=prev_x, y=prev_y
  10. )
  11. new_screenshot_filename = os.path.join(
  12. "screenshots", "screenshot_mini_with_grid.png"
  13. )
  14. with open(new_screenshot_filename, "rb") as img_file:
  15. img_base64 = base64.b64encode(img_file.read()).decode("utf-8")
  16. accurate_vision_prompt = format_accurate_mode_vision_prompt(prev_x, prev_y)
  17. accurate_mode_message = {
  18. "role": "user",
  19. "content": [
  20. {"type": "text", "text": accurate_vision_prompt},
  21. {
  22. "type": "image_url",
  23. "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
  24. },
  25. ],
  26. }
  27. pseudo_messages.append(accurate_mode_message)
  28. response = client.chat.completions.create(
  29. model="gpt-4-vision-preview",
  30. messages=pseudo_messages,
  31. presence_penalty=1,
  32. frequency_penalty=1,
  33. temperature=0.7,
  34. max_tokens=300,
  35. )
  36. content = response.choices[0].message.content
  37. except Exception as e:
  38. print(f"Error reprompting model for accurate_mode: {e}")
  39. return "ERROR"

  get_last_assistant_message # 该函数从消息数组中检索最后一个来自AI助手的消息

  1. def get_last_assistant_message(messages):
  2. """
  3. Retrieve the last message from the assistant in the messages array.
  4. If the last assistant message is the first message in the array, return None.
  5. """
  6. for index in reversed(range(len(messages))):
  7. if messages[index]["role"] == "assistant":
  8. if index == 0: # Check if the assistant message is the first in the array
  9. return None
  10. else:
  11. return messages[index]
  12. return None # Return None if no assistant message is found

  get_next_action # 该函数根据传入的模型、消息数组、目标对象和精确模式来生成下一个操作。

  1. def get_next_action(model, messages, objective, accurate_mode):
  2. if model == "gpt-4-vision-preview":
  3. content = get_next_action_from_openai(messages, objective, accurate_mode)
  4. return content
  5. elif model == "agent-1":
  6. return "coming soon"
  7. elif model == "gemini-pro-vision":
  8. content = get_next_action_from_gemini_pro_vision(
  9. messages, objective
  10. )
  11. return content
  12. raise ModelNotRecognizedException(model)

  format_accurate_mode_vision_prompt # 该函数根据上一次点击的坐标和屏幕尺寸来生成摘要提示

总之,这个函数根据屏幕尺寸和上一次点击的坐标来生成一个格式化的摘要提示,以作为GPT-4模型的输入。

  1. def format_accurate_mode_vision_prompt(prev_x, prev_y):
  2. """
  3. Format the accurate mode vision prompt
  4. """
  5. width = ((ACCURATE_PIXEL_COUNT / 2) / monitor_size["width"]) * 100
  6. height = ((ACCURATE_PIXEL_COUNT / 2) / monitor_size["height"]) * 100
  7. prompt = ACCURATE_MODE_VISION_PROMPT.format(
  8. prev_x=prev_x, prev_y=prev_y, width=width, height=height
  9. )
  10. return prompt

  format_vision_prompt # 该函数根据目标对象和上一次操作来生成摘要提示

  1. def format_vision_prompt(objective, previous_action):
  2. """
  3. Format the vision prompt
  4. """
  5. if previous_action:
  6. previous_action = f"Here was the previous action you took: {previous_action}"
  7. else:
  8. previous_action = ""
  9. prompt = VISION_PROMPT.format(objective=objective, previous_action=previous_action)
  10. return prompt

   format_summary_prompt # 该函数根据目标对象来生成摘要提示,该函数在summarize函数中作为子函数被调用。

  1. def format_summary_prompt(objective):
  2. """
  3. Format the summary prompt
  4. """
  5. prompt = SUMMARY_PROMPT.format(objective=objective)
  6. return prompt

1. 首先,它定义了一个名为`prompt`的空字符串变量,用于存储生成的摘要提示。

2. 然后,它使用`SUMMARY_PROMPT`字符串中的占位符`{}`来替换`objective`,以生成摘要提示。

3. 最后,它将生成的摘要提示返回。

总之,这个函数根据目标对象来生成一个格式化的摘要提示,以作为GPT-4 或者 Gemini-pro-vision模型的输入。

   main # 该函数是Self- Operating Computer的入口点

  1. def main(model, accurate_mode, terminal_prompt, voice_mode=False):
  2. """
  3. Main function for the Self-Operating Computer
  4. """
  5. mic = None
  6. # Initialize `WhisperMic`, if `voice_mode` is True
  7. validation(model, accurate_mode, voice_mode)
  8. if voice_mode:
  9. try:
  10. from whisper_mic import WhisperMic
  11. # Initialize WhisperMic if import is successful
  12. mic = WhisperMic()
  13. except ImportError:
  14. print(
  15. "Voice mode requires the 'whisper_mic' module. Please install it using 'pip install -r requirements-audio.txt'"
  16. )
  17. sys.exit(1)
  18. # Skip message dialog if prompt was given directly
  19. if not terminal_prompt:
  20. message_dialog(
  21. title="Self-Operating Computer",
  22. text="Ask a computer to do anything.",
  23. style=style,
  24. ).run()
  25. else:
  26. print("Running direct prompt...")
  27. print("SYSTEM", platform.system())
  28. # Clear the console
  29. if platform.system() == "Windows":
  30. os.system("cls")
  31. else:
  32. print("\033c", end="")
  33. if terminal_prompt: # Skip objective prompt if it was given as an argument
  34. objective = terminal_prompt
  35. elif voice_mode:
  36. print(
  37. f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RESET} Listening for your command... (speak now)"
  38. )
  39. try:
  40. objective = mic.listen()
  41. except Exception as e:
  42. print(f"{ANSI_RED}Error in capturing voice input: {e}{ANSI_RESET}")
  43. return # Exit if voice input fails
  44. else:
  45. print(f"{ANSI_GREEN}[Self-Operating Computer]\n{ANSI_RESET}{USER_QUESTION}")
  46. print(f"{ANSI_YELLOW}[User]{ANSI_RESET}")
  47. objective = prompt(style=style)
  48. assistant_message = {"role": "assistant", "content": USER_QUESTION}
  49. user_message = {
  50. "role": "user",
  51. "content": f"Objective: {objective}",
  52. }
  53. messages = [assistant_message, user_message]
  54. loop_count = 0
  55. while True:
  56. if DEBUG:
  57. print("[loop] messages before next action:\n\n\n", messages[1:])
  58. try:
  59. response = get_next_action(model, messages, objective, accurate_mode)
  60. action = parse_response(response)
  61. action_type = action.get("type")
  62. action_detail = action.get("data")
  63. except ModelNotRecognizedException as e:
  64. print(
  65. f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] -> {e} {ANSI_RESET}"
  66. )
  67. break
  68. except Exception as e:
  69. print(
  70. f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] -> {e} {ANSI_RESET}"
  71. )
  72. break
  73. if action_type == "DONE":
  74. print(
  75. f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BLUE} Objective complete {ANSI_RESET}"
  76. )
  77. summary = summarize(model, messages, objective)
  78. print(
  79. f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BLUE} Summary\n{ANSI_RESET}{summary}"
  80. )
  81. break
  82. if action_type != "UNKNOWN":
  83. print(
  84. f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA} [Act] {action_type} {ANSI_RESET}{action_detail}"
  85. )
  86. function_response = ""
  87. if action_type == "SEARCH":
  88. function_response = search(action_detail)
  89. elif action_type == "TYPE":
  90. function_response = keyboard_type(action_detail)
  91. elif action_type == "CLICK":
  92. function_response = mouse_click(action_detail)
  93. else:
  94. print(
  95. f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] something went wrong :({ANSI_RESET}"
  96. )
  97. print(
  98. f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] AI response\n{ANSI_RESET}{response}"
  99. )
  100. break
  101. print(
  102. f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA} [Act] {action_type} COMPLETE {ANSI_RESET}{function_response}"
  103. )
  104. message = {
  105. "role": "assistant",
  106. "content": function_response,
  107. }
  108. messages.append(message)
  109. loop_count += 1
  110. if loop_count > 15:
  111. break

   validation # 函数用于验证模型、精确模式和语音模式是否正确配置

  1. def validation(
  2. model,
  3. accurate_mode,
  4. voice_mode,
  5. ):
  6. if accurate_mode and model != "gpt-4-vision-preview":
  7. print("To use accuracy mode, please use gpt-4-vision-preview")
  8. sys.exit(1)
  9. if voice_mode and not OPENAI_API_KEY:
  10. print("To use voice mode, please add an OpenAI API key")
  11. sys.exit(1)
  12. if model == "gpt-4-vision-preview" and not OPENAI_API_KEY:
  13. print("To use `gpt-4-vision-preview` add an OpenAI API key")
  14. sys.exit(1)
  15. if model == "gemini-pro-vision" and not GOOGLE_API_KEY:
  16. print("To use `gemini-pro-vision` add a Google API key")
  17. sys.exit(1)

1. 首先,它检查`accurate_mode`是否设置为`True`,并且模型不是`"gpt-4-vision-preview"`。如果是,则打印一条错误消息,并退出程序。

2. 然后,它检查`voice_mode`是否设置为`True`,并且没有设置OpenAI API密钥。如果是,则打印一条错误消息,并退出程序。

3. 接下来,它检查模型是否为`"gpt-4-vision-preview"`,并且没有设置OpenAI API密钥。如果是,则打印一条错误消息,并退出程序。

4. 然后,它检查模型是否为`"gemini-pro-vision"`,并且没有设置Google API密钥。如果是,则打印一条错误消息,并退出程序。

总之,这个函数用于确保在运行程序时,模型、精确模式和语音模式已经正确配置。

   ModelNotRecognizedException # 该类继承自基类`Exception`。这个类用于在遇到未识别的模型时引发异常

  1. class ModelNotRecognizedException(Exception):
  2. """Exception raised for unrecognized models."""
  3. def __init__(self, model, message="Model not recognized"):
  4. self.model = model
  5. self.message = message
  6. super().__init__(self.message)
  7. def __str__(self):
  8. return f"{self.message} : {self.model} "
  9. # Define style
  10. style = PromptStyle.from_dict(
  11. {
  12. "dialog": "bg:#88ff88",
  13. "button": "bg:#ffffff #000000",
  14. "dialog.body": "bg:#44cc44 #ffffff",
  15. "dialog shadow": "bg:#003800",
  16. }
  17. )
  18. # Check if on a windows terminal that supports ANSI escape codes
  19. def supports_ansi():
  20. """
  21. Check if the terminal supports ANSI escape codes
  22. """
  23. plat = platform.system()
  24. supported_platform = plat != "Windows" or "ANSICON" in os.environ
  25. is_a_tty = hasattr(sys.stdout, "isatty") and sys.stdout.isatty()
  26. return supported_platform and is_a_tty
  27. if supports_ansi():
  28. # Standard green text
  29. ANSI_GREEN = "\033[32m"
  30. # Bright/bold green text
  31. ANSI_BRIGHT_GREEN = "\033[92m"
  32. # Reset to default text color
  33. ANSI_RESET = "\033[0m"
  34. # ANSI escape code for blue text
  35. ANSI_BLUE = "\033[94m" # This is for bright blue
  36. # Standard yellow text
  37. ANSI_YELLOW = "\033[33m"
  38. ANSI_RED = "\033[31m"
  39. # Bright magenta text
  40. ANSI_BRIGHT_MAGENTA = "\033[95m"
  41. else:
  42. ANSI_GREEN = ""
  43. ANSI_BRIGHT_GREEN = ""
  44. ANSI_RESET = ""
  45. ANSI_BLUE = ""
  46. ANSI_YELLOW = ""
  47. ANSI_RED = ""
  48. ANSI_BRIGHT_MAGENTA = ""

   keyboard_type# 用于通过程序模拟键盘输入

  1. def keyboard_type(text):
  2. text = text.replace("\\n", "\n")
  3. for char in text:
  4. pyautogui.write(char)
  5. pyautogui.press("enter")
  6. return "Type: " + text

代码中的Prompt设定

  1. SUMMARY_PROMPT = """
  2. You are a Self-Operating Computer. A user request has been executed. Present the results succinctly.
  3. Include the following key contexts of the completed request:
  4. 1. State the original objective.
  5. 2. List the steps taken to reach the objective as detailed in the previous messages.
  6. 3. Reference the screenshot that was used.
  7. Summarize the actions taken to fulfill the objective. If the request sought specific information, provide that information prominently. NOTE: Address directly any question posed by the user.
  8. Remember: The user will not interact with this summary. You are solely reporting the outcomes.
  9. Original objective: {objective}
  10. Display the results clearly:
  11. """

这是一个多行字符串,其中包含 Self-Operating Computer 的摘要提示。它用于引导助手在执行用户请求时创建简洁明了的摘要。

提示从介绍助手为 Self-Operating Computer 并要求提供简洁的摘要开始。它然后询问助手以简洁的方式呈现结果,包括完成请求的关键上下文。

提示包括三个关键点:

1. 用户请求的原始目的。
2. 实现该目的所采取的步骤,包括详细的消息。
3. 如果适用,引用使用的屏幕截图。

提示还要求助手以简洁的方式总结actions 为实现 objectives 所采取的措施,强调用户请求中任何请求的信息。它还提醒助手直接在摘要中回答用户提出的问题。

最后,提示以原始目的结束,可用于在摘要中作为参考。

USER_QUESTION = "Hello, I can help you with anything. What would you like done?"
  1. ACCURATE_MODE_VISION_PROMPT = """
  2. It looks like your previous attempted action was clicking on "x": {prev_x}, "y": {prev_y}. This has now been moved to the center of this screenshot.
  3. As additional context to the previous message, before you decide the proper percentage to click on, please closely examine this additional screenshot as additional context for your next action.
  4. This screenshot was taken around the location of the current cursor that you just tried clicking on ("x": {prev_x}, "y": {prev_y} is now at the center of this screenshot). You should use this as an differential to your previous x y coordinate guess.
  5. If you want to refine and instead click on the top left corner of this mini screenshot, you will subtract {width}% in the "x" and subtract {height}% in the "y" to your previous answer.
  6. Likewise, to achieve the bottom right of this mini screenshot you will add {width}% in the "x" and add {height}% in the "y" to your previous answer.
  7. There are four segmenting lines across each dimension, divided evenly. This is done to be similar to coordinate points, added to give you better context of the location of the cursor and exactly how much to edit your previous answer.
  8. Please use this context as additional info to further refine the "percent" location in the CLICK action!
  9. """
  1. VISION_PROMPT = """
  2. You are a Self-Operating Computer. You use the same operating system as a human.
  3. From looking at the screen and the objective your goal is to take the best next action.
  4. To operate the computer you have the four options below.
  5. 1. CLICK - Move mouse and click
  6. 2. TYPE - Type on the keyboard
  7. 3. SEARCH - Search for a program on Mac and open it
  8. 4. DONE - When you completed the task respond with the exact following phrase content
  9. Here are the response formats below.
  10. 1. CLICK
  11. Response: CLICK {{ "x": "percent", "y": "percent", "description": "~description here~", "reason": "~reason here~" }}
  12. Note that the percents work where the top left corner is "x": "0%" and "y": "0%" and the bottom right corner is "x": "100%" and "y": "100%"
  13. 2. TYPE
  14. Response: TYPE "value you want to type"
  15. 2. SEARCH
  16. Response: SEARCH "app you want to search for on Mac"
  17. 3. DONE
  18. Response: DONE
  19. Here are examples of how to respond.
  20. __
  21. Objective: Follow up with the vendor in outlook
  22. TYPE Hello, I hope you are doing well. I wanted to follow up
  23. __
  24. Objective: Open Spotify and play the beatles
  25. SEARCH Spotify
  26. __
  27. Objective: Find an image of a banana
  28. CLICK {{ "x": "50%", "y": "60%", "description": "Click: Google Search field", "reason": "This will allow me to search for a banana" }}
  29. __
  30. Objective: Go buy a book about the history of the internet
  31. TYPE https://www.amazon.com/
  32. __
  33. A few important notes:
  34. - Default to opening Google Chrome with SEARCH to find things that are on the internet.
  35. - Go to Google Docs and Google Sheets by typing in the Chrome Address bar
  36. - When opening Chrome, if you see a profile icon click that to open chrome fully, it is located at: {{ "x": "50%", "y": "55%" }}
  37. - The Chrome address bar is generally at: {{ "x": "50%", "y": "9%" }}
  38. - After you click to enter a field you can go ahead and start typing!
  39. - Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user.
  40. {previous_action}
  41. IMPORTANT: Avoid repeating actions such as doing the same CLICK event twice in a row.
  42. Objective: {objective}
  43. """

您是一台自运行计算机。您使用与人类相同的操作系统运行。

从查看屏幕和目标对象的任务中,您试图找到最佳下一步行动。

为了操作计算机,您有四个选项 below。

1. 点击 - 移动鼠标并单击
2. 输入 - 在键盘上输入
3. 搜索 - 在Mac上搜索应用程序并打开它
4. 完成 - 当您完成任务时,请使用以下确切短语响应

以下是响应格式 below。

1. 点击
响应: CLICK {{ "x": "百分比", "y": "百分比", "description": "~description here~", "reason": "~reason here~" }} 
请注意,百分比工作在左上角为 "x": "0%" 和 "y": "0%" 以及右下角为 "x": "100%" 和 "y": "100%" 的位置

2. 输入
响应: TYPE "要输入的值"

2. 搜索
响应: SEARCH "要在Mac上搜索的应用程序"

3. 完成
响应: DONE

以下是响应示例。
```
Objective: 给供应商发邮件
TYPE 你好,我希望你很好。我想要继续和你联系
```

```
Objective: 在Spotify上播放 Beatles
SEARCH Spotify
```

```
Objective: 查找一张香蕉图片
CLICK {{ "x": "50%", "y": "60%", "description": "点击: Google搜索字段", "reason": "这允许我搜索香蕉" }}
```

```
Objective: 去买一本关于互联网的历史书
TYPE https://www.amazon.com/
```

重要注意事项:

- 如果您看到Google Chrome的配置文件图标,请单击它以打开Google Chrome完全。它位于:{{ "x": "50%", "y": "55%" }}
- Google Docs和Google Sheets可以通过在Google Chrome地址栏中输入来打开
- 当打开Google Chrome时,如果看到一个配置文件图标,请单击以打开Google Chrome完全
- Chrome地址栏通常位于:{{ "x": "50%", "y": "9%" }}
- 一旦您点击进入字段,就可以开始 typing!
- 不要在无法帮助用户请求的情况下回答“无法”。您可以通过发送文本响应 indirect 地与用户的操作系统进行交互。

{previous_action}

请注意:避免重复执行相同的 CLICK 事件。


一些常规变量的设定

  1. ACCURATE_PIXEL_COUNT = (
  2. 200 # mini_screenshot is ACCURATE_PIXEL_COUNT x ACCURATE_PIXEL_COUNT big
  3. )
  1. monitor_size = {
  2. "width": 1920,
  3. "height": 1080,
  4. }

大语言模型的API设定 

  1. DEBUG = False
  2. OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
  3. GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
  4. if OPENAI_API_KEY:
  5. client = OpenAI()
  6. client.api_key = OPENAI_API_KEY
  7. client.base_url = os.getenv("OPENAI_API_BASE_URL", client.base_url)

调用的包

  1. """
  2. Self-Operating Computer
  3. """
  4. import os
  5. import time
  6. import base64
  7. import json
  8. import math
  9. import re
  10. import subprocess
  11. import pyautogui
  12. import argparse
  13. import platform
  14. import Xlib.display
  15. import Xlib.X
  16. import Xlib.Xutil # not sure if Xutil is necessary
  17. import google.generativeai as genai
  18. from prompt_toolkit import prompt
  19. from prompt_toolkit.shortcuts import message_dialog
  20. from prompt_toolkit.styles import Style as PromptStyle
  21. from dotenv import load_dotenv
  22. from PIL import Image, ImageDraw, ImageFont, ImageGrab
  23. import matplotlib.font_manager as fm
  24. from openai import OpenAI
  25. import sys

python-dotenv的详细用法_python dotenv-CSDN博客

dotenv这个包(库)主要是用来加载环境变量的

Xlib 函数库简介--x window 工作原理简介_xlib 库-CSDN博客

xlib这个包(库)主要是用来与Window系统交互的库

   在 X Window 的世界里,可以说所有的动作都是由「事件 (Event) 」所触发并完成的,不论是对 X Client 或是 X Server 都是一样。从 X Client 的角度来看,每个 X 应用程序内部都有一个处理事件的回圈 (event loop),程序静静地等待事件的发生,一旦 Xlib 截获一个属于该应用程序的事件并传送给它时,该事件就会在事件处理回圈中产生相应的动作,处理完后,又会回到原点,等待下一个事件的发生。可能发生的事件有很多种,像是其他的视窗传来讯息、键盘滑鼠有了动作、桌面管理程序要求改变视窗的大小状态 .... 

Python Module — prompt_toolkit CLI 库-CSDN博客

python prompt toolkit-用于构建功能强大的交互式命令行的python库 (360doc.com)

prompt_toolkit 这个包(库)是用于在Python中构建功能强大的交互式命令行应用程序的库。基于文本终端的 UI

Python自动操作 GUI 神器——PyAutoGUI (baidu.com)

一个神奇的GUI自动化测试库-PyAutoGui-CSDN博客 #这个链接有很多图文案例,清晰易懂

PyAutoGUI这个包(库)是一个简单易用,跨平台的可以模拟键盘鼠标进行自动操作的 python 库,可实现控制鼠标、键盘、消息框、截图、定位等功能,上能挂机刷宝箱,下能自动写文档.

Python笔记:Pydantic库简介-CSDN博客 #这个链接有demo,清晰易懂

Pydantic这个包(库)是一个常用的用于数据接口schema定义与检查的库。通过pydantic库,我们可以更为规范地定义和使用数据接口,这对于大型项目的开发将会更为友好。

python操作windows窗口,python库pygetwindow使用详解-CSDN博客

Pygetwindow这个包(库)提供了一些方法和属性,使得在Python程序中可以轻松地执行各种窗口操作。

业务逻辑

架构-模块

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/244118
推荐阅读
相关标签
  

闽ICP备14008679号