Rules and Filters
This guide explains how to configure share rules to control which files are processed, how they are filtered, and what happens after processing.
Overview
Share rules allow you to customize file processing behavior on a per-share basis. Rules are configured in the rules object when creating or updating a share via the API.
Key capabilities:
- File pattern filtering - Include or exclude files based on filename patterns
- Size filtering - Process only files within specified size ranges
- Date filtering - Filter files by creation, modification, or access dates
- Rolling date windows - Dynamic date filters that update automatically
- Content persistence - Control whether file content is retained after upload
- Upload control - Enable or disable Microsoft Graph/Copilot integration
- ACL overrides - Bypass file-level permissions (see ACL Override Guide
- NER analysis - Enable entity extraction (see NER Feature Guide - COMING SOON)
Quick Start
Basic Share with Default Rules
POST /shares
{
"share_path": "\\\\server\\share",
"username": "domain\\user",
"password": "password",
"realm": "DOMAIN.COM",
"use_kerberos": "required"
}Default rules are applied automatically:
- All files included (no pattern filtering)
- Max file size: 1GB
- Min file size: 0 bytes
- Content persisted after upload
- Microsoft Graph upload enabled
Share with Custom Rules
POST /shares
{
"share_path": "\\\\server\\documents",
"username": "domain\\user",
"password": "password",
"realm": "DOMAIN.COM",
"use_kerberos": "required",
"rules": {
"include_patterns": ["*.pdf", "*.docx", "*.xlsx"],
"max_file_size": 104857600,
"modified_within_days": 90,
"persist_file_content": false
}
}Pattern Filtering
Pattern filtering allows you to control which files are processed based on their filenames and paths.
Include Patterns
Use include_patterns to specify which files should be processed. All other files are ignored.
{
"rules": {
"include_patterns": ["*.pdf", "*.docx", "*.xlsx", "*.pptx"]
}
}Pattern syntax:
| Pattern | Matches |
|---|---|
*.pdf | All PDF files |
*.doc* | Files ending in .doc, .docx, .docm, etc. |
report_* | Files starting with "report_" |
**/reports/** | Any file in a "reports" directory at any depth |
**/2024/*.pdf | PDF files in any "2024" directory |
Examples:
// Only process Office documents
{
"rules": {
"include_patterns": ["*.pdf", "*.docx", "*.xlsx", "*.pptx"]
}
}
// Only process files in specific directories
{
"rules": {
"include_patterns": ["**/contracts/**", "**/invoices/**", "**/reports/**"]
}
}
// Combine file types and paths
{
"rules": {
"include_patterns": ["**/legal/*.pdf", "**/finance/*.xlsx", "*.docx"]
}
}Exclude Patterns
Use exclude_patterns to specify which files should be ignored. All other files are processed.
{
"rules": {
"exclude_patterns": ["*.tmp", "*.bak", "~$*", ".git/*"]
}
}Common exclusion patterns:
| Pattern | Purpose |
|---|---|
*.tmp, *.bak | Temporary and backup files |
~$* | Office temporary files |
.git/*, .svn/* | Version control directories |
**/node_modules/** | Node.js dependencies |
**/cache/** | Cache directories |
Thumbs.db, .DS_Store | System files |
Examples:
// Exclude temporary and system files
{
"rules": {
"exclude_patterns": ["*.tmp", "*.bak", "~$*", "Thumbs.db", ".DS_Store"]
}
}
// Exclude specific directories
{
"rules": {
"exclude_patterns": ["**/archive/**", "**/backup/**", "**/temp/**"]
}
}Important: Mutual Exclusivity
⚠️ Warning: You cannot use bothinclude_patternsandexclude_patternsin the same share.
Choose one approach:
- Use
include_patternswhen you want to process only specific file types - Use
exclude_patternswhen you want to process most files but skip certain ones
// ❌ INVALID - will return an error
{
"rules": {
"include_patterns": ["*.pdf"],
"exclude_patterns": ["*.tmp"]
}
}
// ✅ VALID - use include patterns only
{
"rules": {
"include_patterns": ["*.pdf"]
}
}Pattern Matching Behavior
- Case insensitive:
*.PDFand*.pdfmatch the same files - Path separators: Use
**for recursive matching across directories - Wildcards:
*matches any characters,?matches single character
Size Filtering
Control which files are processed based on their size.
Maximum File Size
{
"rules": {
"max_file_size": 104857600
}
}Files larger than this size (in bytes) are skipped.
Common size values:
| Size | Bytes |
|---|---|
| 1 MB | 1048576 |
| 10 MB | 10485760 |
| 50 MB | 52428800 |
| 100 MB | 104857600 |
| 500 MB | 524288000 |
| 1 GB | 1073741824 (default) |
Minimum File Size
{
"rules": {
"min_file_size": 1024
}
}Files smaller than this size (in bytes) are skipped. Useful for excluding empty or near-empty files.
Combined Size Filtering
{
"rules": {
"min_file_size": 1024,
"max_file_size": 52428800
}
}This processes only files between 1 KB and 50 MB.
Date Filtering
Filter files based on their timestamps. Two approaches are available:
- Static dates - Fixed date/time boundaries
- Rolling windows - Dynamic windows that update automatically
Static Date Filters
Use ISO8601 datetime format for precise date boundaries.
{
"rules": {
"created_at_min": "2024-01-01T00:00:00Z",
"created_at_max": "2024-12-31T23:59:59Z",
"modified_time_min": "2024-06-01T00:00:00Z",
"modified_time_max": "2024-12-31T23:59:59Z",
"accessed_at_min": "2024-01-01T00:00:00Z",
"accessed_at_max": "2024-12-31T23:59:59Z"
}
}Available static filters:
| Filter | Description |
|---|---|
created_at_min | Only files created at or after this date |
created_at_max | Only files created before or at this date |
modified_time_min | Only files modified at or after this date |
modified_time_max | Only files modified before or at this date |
accessed_at_min | Only files accessed at or after this date |
accessed_at_max | Only files accessed before or at this date |
Rolling Window Filters
Rolling windows automatically update based on the current date, making them ideal for ongoing synchronization.
{
"rules": {
"modified_within_days": 30
}
}This processes only files modified in the last 30 days. The window moves forward automatically with each crawl.
Available rolling filters:
| Filter | Description |
|---|---|
created_within_days | Files created within the last N days |
created_within_months | Files created within the last N months |
created_within_years | Files created within the last N years |
modified_within_days | Files modified within the last N days |
modified_within_months | Files modified within the last N months |
modified_within_years | Files modified within the last N years |
accessed_within_days | Files accessed within the last N days |
accessed_within_months | Files accessed within the last N months |
accessed_within_years | Files accessed within the last N years |
Examples:
// Process files modified in the last 90 days
{
"rules": {
"modified_within_days": 90
}
}
// Process files created in the last 2 years
{
"rules": {
"created_within_years": 2
}
}
// Process recently accessed files (last 6 months)
{
"rules": {
"accessed_within_months": 6
}
}Important: Static vs Rolling
⚠️ Warning: You cannot combine static dates and rolling windows for the same timestamp type.
// ❌ INVALID - mixing static and rolling for modified_time
{
"rules": {
"modified_time_min": "2024-01-01T00:00:00Z",
"modified_within_days": 30
}
}
// ✅ VALID - use one approach per timestamp type
{
"rules": {
"modified_within_days": 30,
"created_at_min": "2024-01-01T00:00:00Z"
}
}Content Persistence
Control whether extracted file content is retained in the database after successful upload to Microsoft Graph.
persist_file_content
{
"rules": {
"persist_file_content": true
}
}| Value | Behavior |
|---|---|
true (default) | Keep extracted content in database after Graph upload |
false | Clear content from database after successful Graph upload |
Use cases for false:
- Reduce database storage requirements
- Comply with data retention policies
- Minimize data exposure risk
📘 Note: The PERSIST_FILE_CONTENT_OVERRIDE environment variable can override this setting at the deployment level for security purposes. Upload Control
Control whether files are uploaded to Microsoft Graph/Copilot.
enable_copilot_upload
{
"rules": {
"enable_copilot_upload": true
}
}| Value | Behavior |
|---|---|
true (default) | Upload files to Microsoft Graph for Copilot integration |
false | Store files in local database only, no Graph upload |
Use cases for false:
- Local-only deployments without Microsoft 365
- Testing and development environments
- Customers who want direct database access without Copilot
Complete Rules Reference
All Available Rules
{
"rules": {
// Pattern filtering (mutually exclusive)
"exclude_patterns": ["*.tmp", "*.bak"],
"include_patterns": ["*.pdf", "*.docx"],
// Size filtering
"max_file_size": 1073741824,
"min_file_size": 0,
// Static date filters
"created_at_min": "2024-01-01T00:00:00Z",
"created_at_max": "2024-12-31T23:59:59Z",
"modified_time_min": "2024-01-01T00:00:00Z",
"modified_time_max": "2024-12-31T23:59:59Z",
"accessed_at_min": "2024-01-01T00:00:00Z",
"accessed_at_max": "2024-12-31T23:59:59Z",
// Rolling date filters
"created_within_days": 30,
"created_within_months": 6,
"created_within_years": 2,
"modified_within_days": 30,
"modified_within_months": 6,
"modified_within_years": 2,
"accessed_within_days": 30,
"accessed_within_months": 6,
"accessed_within_years": 2,
// Content and upload control
"persist_file_content": true,
"enable_copilot_upload": true,
// ACL overrides (see acl-override-guide)
"acl_override_mode": "everyone",
"acl_override_principals": [],
// NER analysis (see ner-feature-guide)
"enable_ner_analysis": false,
"ner_schema": "default",
"ner_entity_types": ["person", "organization", "location"],
"ner_classifications": {},
"ner_structured_extraction": {},
"ner_confidence_threshold": 0.7
}
}Default Values
| Rule | Default Value |
|---|---|
exclude_patterns | [] (empty) |
include_patterns | [] (empty) |
max_file_size | 1073741824 (1 GB) |
min_file_size | 0 |
created_at_min/max | null (no filter) |
modified_time_min/max | null (no filter) |
accessed_at_min/max | null (no filter) |
*_within_days/months/years | null (no filter) |
persist_file_content | true |
enable_copilot_upload | true |
acl_override_mode | null (use file-level ACLs) |
enable_ner_analysis | false |
Updating Rules
Use the PATCH endpoint to update rules on existing shares.
Add or Modify Rules
PATCH /shares/{share_id}
{
"rules": {
"include_patterns": ["*.pdf", "*.docx"],
"modified_within_days": 60
}
}Remove a Rule
Set the rule to its default value or null:
PATCH /shares/{share_id}
{
"rules": {
"include_patterns": [],
"modified_within_days": null
}
}Trigger Re-crawl After Rule Changes
After changing rules, trigger a re-crawl to apply the new filters:
POST /shares/{share_id}/crawlCommon Configurations
Office Documents Only
{
"rules": {
"include_patterns": [
"*.pdf",
"*.doc",
"*.docx",
"*.xls",
"*.xlsx",
"*.ppt",
"*.pptx"
],
"max_file_size": 104857600
}
}Recent Files with Storage Optimization
{
"rules": {
"modified_within_months": 6,
"max_file_size": 52428800,
"persist_file_content": false
}
}Development/Test Environment
{
"rules": {
"include_patterns": ["*.pdf"],
"max_file_size": 10485760,
"enable_copilot_upload": false
}
}Legal/Compliance Documents
{
"rules": {
"include_patterns": ["**/contracts/**", "**/legal/**", "**/compliance/**"],
"exclude_patterns": [],
"created_within_years": 7,
"persist_file_content": true
}
}Exclude Temporary and System Files
{
"rules": {
"exclude_patterns": [
"*.tmp",
"*.bak",
"*.swp",
"~$*",
"Thumbs.db",
".DS_Store",
"desktop.ini",
"**/node_modules/**",
"**/.git/**",
"**/cache/**"
]
}
}Troubleshooting
Files Not Being Processed
- Check pattern matching: Verify your patterns match the expected files
- Check size limits: Ensure files are within min/max size range
- Check date filters: Verify files fall within date boundaries
- View share rules:
GET /shares/{share_id}to confirm rules are saved
Too Many Files Being Processed
- Add include patterns: Narrow down to specific file types
- Add date filters: Limit to recent files
- Reduce max size: Skip large files
Rules Not Taking Effect
- Verify rules saved:
GET /shares/{share_id}and checkrulesfield - Trigger re-crawl:
POST /shares/{share_id}/crawl - Check logs: Look for filtering messages in application logs
Validation Errors
| Error | Cause | Solution |
|---|---|---|
| "Cannot specify both include_patterns and exclude_patterns" | Both pattern types specified | Use only one pattern type |
| "Cannot specify both X and static Y" | Mixed rolling and static dates | Use only one date filter type per timestamp |