LLM-CodeSlim: Автоматическое сжатие и очистка кода для эффективного использования с LLM

Как известно, у больших языковых моделей (LLM) существуют ограничения по размеру контекстного окна. При постановке вопроса часто невозможно вставить весь исходный текст, что требует объединения кода из разных файлов в одном месте.

В связи с этим я разработал скрипт, который минимизирует исходный код проекта путем удаления пробелов, табуляций, комментариев и тестовых функций. Скрипт позволяет собрать все или выбранные файлы проекта в одном месте.

Для использования просто запустите скрипт в директории вашего проекта, чтобы сгенерировать минимизированный файл out.txt, содержащий оптимизированный код, готовый для использования с крупными языковыми моделями.

Перед запуском скрипта отредактируйте следующие массивы в соответствии с потребностями вашего проекта: folders_to_ignore, extensions_to_search, filenames_to_search, comment_chars и stop_words.

Пример конфигурации для проекта на Rust (включение всех файлов *.rs в out.txt):

folders_to_ignore=("target" ".git" ".github" ".gitignore" ".idea" )   # Folders to ignore extensions_to_search=( "rs" )                              # File extensions to search for filenames_to_search=("Cargo.toml")                       # Filenames to search for comment_chars=("#" "//" "/*")                            # Characters that denote comments stop_words=("#[cfg(test)]")                              # Stop words after which to ignore the remaining lines in the file

Пример конфигурации для проекта на Rust (включение только определенных файлов в out.txt):

folders_to_ignore=("target" ".git" ".github" ".gitignore" ".idea" )   # Folders to ignore extensions_to_search=( )                              # File extensions to search for filenames_to_search=("Cargo.toml" "lib.rs" "core.rs")                       # Filenames to search for comment_chars=("#" "//" "/*")                            # Characters that denote comments stop_words=("#[cfg(test)]")                              # Stop words after which to ignore the remaining lines in the file

Bash-версия скрипта:

#!/bin/bash  # Remove existing out.txt if it exists rm -f out.txt  # Arrays folders_to_ignore=("target" ".git" ".github" ".gitignore" ".idea" )   # Folders to ignore extensions_to_search=( "rs" )                              # File extensions to search for filenames_to_search=("Cargo.toml" "core.rs" "text.rs" "json.rs")                       # Filenames to search for comment_chars=("#" "//" "/*")                            # Characters that denote comments stop_words=("#[cfg(test)]")                              # Stop words after which to ignore the remaining lines in the file  # Build the 'find' command  # Start with the basic 'find' command find_cmd="find ."  # Add folders to ignore if [ ${#folders_to_ignore[@]} -gt 0 ]; then     ignore_dir_expr=""     for dir in "${folders_to_ignore[@]}"; do         if [ -n "$ignore_dir_expr" ]; then             ignore_dir_expr+=" -o "         fi         ignore_dir_expr+="-path './$dir' -prune"     done     find_cmd+=" \\( $ignore_dir_expr \\) -o" fi  # Add conditions to search for files find_cmd+=" \\( "  name_patterns=()  # Add file extensions for ext in "${extensions_to_search[@]}"; do     name_patterns+=("-name '*.$ext'") done  # Add filenames for fname in "${filenames_to_search[@]}"; do     name_patterns+=("-name '$fname'") done  # Combine all patterns using -o for ((i=0; i<${#name_patterns[@]}; i++)); do     find_cmd+=" ${name_patterns[$i]}"     if [ $i -lt $((${#name_patterns[@]} - 1)) ]; then         find_cmd+=" -o"     fi done  find_cmd+=" \\) -type f -print"  # Print the final command for debugging (you can comment out this line) # echo "Running command: $find_cmd"  # Build the regular expression for comments comment_pattern="" for ((i=0; i<${#comment_chars[@]}; i++)); do     # Escape special characters in comment characters     escaped_char=$(printf '%s\n' "${comment_chars[$i]}" | sed 's/[][(){}.*+?^$\\|/]/\\&/g')     if [ $i -eq 0 ]; then         comment_pattern="$escaped_char"     else         comment_pattern="$comment_pattern|$escaped_char"     fi done  # Execute the 'find' command and process the results while read filepath; do     echo -e "\n#### $filepath ####" >> out.txt     stop=false     # Process the file line by line     while IFS= read -r line; do         if [ "$stop" = true ]; then             break         fi         # Remove tabs         line="${line//$'\t'/}"         # Remove leading spaces         line="${line#"${line%%[![:space:]]*}"}"         # Remove trailing spaces         line="${line%"${line##*[![:space:]]}"}"         # Skip lines that are empty or contain only spaces         if [[ -z "$line" ]]; then             continue         fi         # Check for stop words         for stop_word in "${stop_words[@]}"; do             if [[ "$line" == "$stop_word" ]]; then                 stop=true                 break             fi         done         if [ "$stop" = true ]; then             break         fi         # Skip lines that are comments         if [[ "$line" =~ ^($comment_pattern) ]]; then             continue         fi         # Write the processed line to out.txt         echo "$line" >> out.txt     done < "$filepath" done < <(eval $find_cmd)

PowerShell-версия скрипта:

# Remove existing out.txt if it exists if (Test-Path -Path "out.txt") {     Remove-Item -Path "out.txt" -Force }  # Define arrays  # Folders and files to ignore during the search $foldersToIgnore = @("target", ".git", ".github", ".gitignore", ".idea")  # File extensions to search for $extensionsToSearch = @("rs")  # Specific filenames to search for $filenamesToSearch = @("Cargo.toml", "core.rs", "text.rs", "json.rs")  # Characters that denote comments in the files $commentChars = @("#", "//", "/*")  # Words that, when encountered, will stop processing the current file $stopWords = @("#[cfg(test)]")  # Function to build file filtering based on provided criteria function Get-FilteredFiles {     param (         [string[]]$IgnoreFolders,         [string[]]$Extensions,         [string[]]$Filenames     )      # Build a regex pattern for ignored folders     if ($IgnoreFolders.Count -gt 0) {         $ignorePattern = ($IgnoreFolders | ForEach-Object { [regex]::Escape($_) }) -join '|'     } else {         $ignorePattern = ""     }      # Build a list of filters for extensions and filenames     $nameFilters = @()     foreach ($ext in $Extensions) {         $nameFilters += "*.$ext"     }     foreach ($fname in $Filenames) {         $nameFilters += $fname     }      # Get all files with the specified extensions or filenames     Get-ChildItem -Path . -Recurse -File -Include $nameFilters | Where-Object {         if ($ignorePattern) {             # Check if the full path contains any of the ignored folders             -not ($_.FullName -match "\\($ignorePattern)\\")         } else {             $true         }     } }  # Build a regex pattern for comments $escapedCommentChars = $commentChars | ForEach-Object { [regex]::Escape($_) } $commentPattern = $escapedCommentChars -join '|'  # Get the list of files to process $files = Get-FilteredFiles -IgnoreFolders $foldersToIgnore -Extensions $extensionsToSearch -Filenames $filenamesToSearch  # Process each file foreach ($file in $files) {     # Add file header to out.txt     "`n#### $($file.FullName) ####" | Out-File -FilePath "out.txt" -Append -Encoding utf8      $stop = $false      # Read the file line by line     Get-Content -Path $file.FullName | ForEach-Object {         if ($stop) {             return         }          $line = $_          # Remove tabs         $line = $line -replace "`t", ""          # Trim leading and trailing spaces         $line = $line.Trim()          # Skip empty lines         if ([string]::IsNullOrWhiteSpace($line)) {             return         }          # Check for stop words         foreach ($stopWord in $stopWords) {             if ($line -eq $stopWord) {                 $stop = $true                 break             }         }         if ($stop) {             return         }          # Skip lines that are comments         if ($line -match "^($commentPattern)") {             return         }          # Write the processed line to out.txt         $line | Out-File -FilePath "out.txt" -Append -Encoding utf8     } }

P.S.

Содержимое файла out.txt необходимо скопировать в буфер обмена и вставить как текст в окно ввода LLM. Не прикрепляйте файл out.txt к вопросу. Обычно LLM из соображений оптимизации обрабатывает файлы, извлекая из них резюме, и на основе этого резюме отвечает на вопрос. Другими словами, если вы вставите содержимое файла out.txt в окно ввода LLM и затем зададите вопрос, модель будет отвечать на основе всего содержимого файла out.txt.

Исходный код скриптов находится на GitHub, если у вас есть улучшения, то делайте pull request.

ссылка на оригинал статьи https://habr.com/ru/articles/843274/

LLM-CodeSlim: Автоматическое сжатие и очистка кода для эффективного использования с LLM

Комментарии

Добавить комментарий Отменить ответ