Skip to content

Env variable fetch from secret manager #1178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,24 +118,23 @@ Allow unauthenticated request : Yes
|-------------------------|--------------------|---------------|--------------------------------------------------------------------------------------------------|
| |
| **BACKEND ENV**
| OPENAI_API_KEY | Mandatory | |An OpenAPI Key is required to use open LLM model to authenticate andn track requests |
| OPENAI_API_KEY | Optional | |An Open AI key to use incase of openai embeddings |
| DIFFBOT_API_KEY | Mandatory | |API key is required to use Diffbot's NLP service to extraction entities and relatioship from unstructured data|
| BUCKET | Mandatory | |bucket name to store uploaded file on GCS |
| BUCKET_UPLOAD_FILE | Optional | |bucket name to store uploaded file on GCS |
| BUCKET_FAILED_FILE | Optional | |bucket name to store failed file on GCS while extraction |
| NEO4J_USER_AGENT | Optional | llm-graph-builder | Name of the user agent to track neo4j database activity |
| ENABLE_USER_AGENT | Optional | true | Boolean value to enable/disable neo4j user agent |
| DUPLICATE_TEXT_DISTANCE | Mandatory | 5 | This value used to find distance for all node pairs in the graph and calculated based on node properties |
| DUPLICATE_SCORE_VALUE | Mandatory | 0.97 | Node score value to match duplicate node |
| EFFECTIVE_SEARCH_RATIO | Mandatory | 1 | |
| GRAPH_CLEANUP_MODEL | Optional | 0.97 | Model name to clean-up graph in post processing |
| DUPLICATE_TEXT_DISTANCE | Optional | 5 | This value used to find distance for all node pairs in the graph and calculated based on node properties |
| DUPLICATE_SCORE_VALUE | Optional | 0.97 | Node score value to match duplicate node |
| EFFECTIVE_SEARCH_RATIO | Optional | 5 | |
| GRAPH_CLEANUP_MODEL | Optional | openai_gpt_4o | Model name to clean-up graph in post processing |
| MAX_TOKEN_CHUNK_SIZE | Optional | 10000 | Maximum token size to process file content |
| YOUTUBE_TRANSCRIPT_PROXY| Optional | | Proxy key to process youtube video for getting transcript |
| YOUTUBE_TRANSCRIPT_PROXY| Mandatory | | Proxy key required to process youtube video for getting transcript |
| EMBEDDING_MODEL | Optional | all-MiniLM-L6-v2 | Model for generating the text embedding (all-MiniLM-L6-v2 , openai , vertexai) |
| IS_EMBEDDING | Optional | true | Flag to enable text embedding |
| KNN_MIN_SCORE | Optional | 0.94 | Minimum score for KNN algorithm |
| KNN_MIN_SCORE | Optional | 0.8 | Minimum score for KNN algorithm |
| GEMINI_ENABLED | Optional | False | Flag to enable Gemini |
| GCP_LOG_METRICS_ENABLED | Optional | False | Flag to enable Google Cloud logs |
| NUMBER_OF_CHUNKS_TO_COMBINE | Optional | 5 | Number of chunks to combine when processing embeddings |
| UPDATE_GRAPH_CHUNKS_PROCESSED | Optional | 20 | Number of chunks processed before updating progress |
| NEO4J_URI | Optional | neo4j://database:7687 | URI for Neo4j database |
| NEO4J_USERNAME | Optional | neo4j | Username for Neo4j database |
| NEO4J_PASSWORD | Optional | password | Password for Neo4j database |
Expand Down
4 changes: 0 additions & 4 deletions backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,6 @@ Update the environment variable in `.env` file. Refer example.env in backend fol

`NEO4J_PASSWORD` : Neo4j database user password

`AWS_ACCESS_KEY_ID` : AWS Access key ID

`AWS_SECRET_ACCESS_KEY` : AWS secret access key


## Contact
For questions or support, feel free to contact us at [email protected] or [email protected]
66 changes: 32 additions & 34 deletions backend/example.env
Original file line number Diff line number Diff line change
@@ -1,32 +1,35 @@
OPENAI_API_KEY = "" #This is required if you are using openai embedding model
EMBEDDING_MODEL = "all-MiniLM-L6-v2" #this can be openai or vertexai or by default all-MiniLM-L6-v2
RAGAS_EMBEDDING_MODEL = "openai" #Keep blank if you want to use all-MiniLM-L6-v2 for ragas embeddings
IS_EMBEDDING = "TRUE"
KNN_MIN_SCORE = "0.94"
# Enable Gemini (default is False) | Can be False or True
GEMINI_ENABLED = False
# Enable Google Cloud logs (default is False) | Can be False or True
GCP_LOG_METRICS_ENABLED = False
NUMBER_OF_CHUNKS_TO_COMBINE = 6
UPDATE_GRAPH_CHUNKS_PROCESSED = 20
NEO4J_URI = ""
NEO4J_USERNAME = ""
NEO4J_PASSWORD = ""
NEO4J_DATABASE = ""
AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""
LANGCHAIN_API_KEY = ""
LANGCHAIN_PROJECT = ""
LANGCHAIN_TRACING_V2 = ""
LANGCHAIN_ENDPOINT = ""
GCS_FILE_CACHE = "" #save the file into GCS or local, SHould be True or False
NEO4J_USER_AGENT=""
ENABLE_USER_AGENT = ""
GET_VALUE_FROM_SECRET_MANAGER= "" #OPTIONAL- Default_Value = False -- True to get all secret variable values from secret manager if available then try to get from env.
OPENAI_API_KEY = "" #OPTIONAL- Default_Value = "openai_api_key" #This is required if you are using openai embedding model
EMBEDDING_MODEL = "" #OPTIONAL- Default_Value ="" #this can be openai or vertexai or by default all-MiniLM-L6-v2
RAGAS_EMBEDDING_MODEL = "" #OPTIONAL- Default_Value ="openai" #Keep blank if you want to use all-MiniLM-L6-v2 for ragas embeddings
IS_EMBEDDING = "" #OPTIONAL- Default_Value ="True" --Flag to enable text embedding
BUCKET_UPLOAD_FILE = "" #OPTIONAL- Default_Value ="gcs bucket name" -- use the gcs bucket to upload local file to gcs cloud
BUCKET_FAILED_FILE = "" #OPTIONAL- Default_Value ="gcs bucket name" -- use the gcs bucket for failed file while extraction
KNN_MIN_SCORE = "" #OPTIONAL- Default_Value ="0.8" --Minimum score for KNN algorithm
GEMINI_ENABLED = "" #OPTIONAL- Default_Value ="False"-- Enable Gemini can be False or True
GCP_LOG_METRICS_ENABLED = "" #OPTIONAL- Default_Value = "False" -- Enable to logs metrics on gcp cloud logging
NEO4J_URI = "" #OPTIONAL- Default_Value ="Neo4j URL"
NEO4J_USERNAME = "" #OPTIONAL- Default_Value = "Neo4J database username"
NEO4J_PASSWORD = "" #OPTIONAL- Default_Value = "Neo4j database user password"
NEO4J_DATABASE = "" #OPTIONAL- Default_Value = "Neo4j database user database"
LANGCHAIN_API_KEY ="" #OPTIONAL- Default_Value = "API key for Langchain"
LANGCHAIN_PROJECT ="" #OPTIONAL- Default_Value = "Project for Langchain "
LANGCHAIN_TRACING_V2 = "" #OPTIONAL- Default_Value = "Flag to enable Langchain tracing "
LANGCHAIN_ENDPOINT = "" #OPTIONAL- Default_Value = "https://api.smith.langchain.com" -- Endpoint for Langchain API
GCS_FILE_CACHE = "" #OPTIONAL- Default_Value = "False" #save the file into GCS or local, SHould be True or False
NEO4J_USER_AGENT="" #OPTIONAL- Default_Value = "LLM-Graph-Builder"
ENABLE_USER_AGENT = "" #OPTIONAL- Default_Value = "False"
MAX_TOKEN_CHUNK_SIZE="" #OPTIONAL- Default_Value = "10000" #Max token used to process/extract the file content.
ENTITY_EMBEDDING="" #OPTIONAL- Default_Value = "False"-- Value based on whether to create embeddings for entities suitable for entity vector mode
DUPLICATE_SCORE_VALUE = "" #OPTIONAL- Default_Value = "0.97" -- Node score value to match duplicate node
DUPLICATE_TEXT_DISTANCE = "" #OPTIONAL- Default_Value = "3" --This value used to find distance for all node pairs in the graph and calculated based on node properties
DEFAULT_DIFFBOT_CHAT_MODEL="" #OPTIONAL- Default_Value = "openai_gpt_4o" #whichever model specified here , need to add config for that model in below format)
GRAPH_CLEANUP_MODEL="" #OPTIONAL- Default_Value = "openai_gpt_4o" -- Model name to clean-up graph in post processing
BEDROCK_EMBEDDING_MODEL="" #Mandatory - Default_Value = "model_name,aws_access_key,aws_secret_key,region_name" -- If want to use bedrock embedding #model_name="amazon.titan-embed-text-v1"
YOUTUBE_TRANSCRIPT_PROXY="" #Mandatory --Proxy key required to process youtube video for getting transcript --Sample Value ="https://user:pass@domain:port"
EFFECTIVE_SEARCH_RATIO="" #OPTIONAL- Default_Value = "2"

LLM_MODEL_CONFIG_model_version=""
ENTITY_EMBEDDING="TRUE" # TRUE or FALSE based on whether to create embeddings for entities suitable for entity vector mode
DUPLICATE_SCORE_VALUE =0.97
DUPLICATE_TEXT_DISTANCE =3
DEFAULT_DIFFBOT_CHAT_MODEL="openai_gpt_4o" #whichever model specified here , need to add config for that model in below format)
#examples
LLM_MODEL_CONFIG_openai_gpt_3.5="gpt-3.5-turbo-0125,openai_api_key"
LLM_MODEL_CONFIG_openai_gpt_4o_mini="gpt-4o-mini-2024-07-18,openai_api_key"
Expand All @@ -43,13 +46,8 @@ LLM_MODEL_CONFIG_anthropic_claude_3_5_sonnet="model_name,anthropic_api_key"
LLM_MODEL_CONFIG_fireworks_llama_v3_70b="model_name,fireworks_api_key"
LLM_MODEL_CONFIG_bedrock_claude_3_5_sonnet="model_name,aws_access_key_id,aws_secret__access_key,region_name"
LLM_MODEL_CONFIG_ollama_llama3="model_name,model_local_url"
YOUTUBE_TRANSCRIPT_PROXY="https://user:pass@domain:port"
EFFECTIVE_SEARCH_RATIO=5
GRAPH_CLEANUP_MODEL="openai_gpt_4o"
BEDROCK_EMBEDDING_MODEL="model_name,aws_access_key,aws_secret_key,region_name" #model_name="amazon.titan-embed-text-v1"
LLM_MODEL_CONFIG_bedrock_nova_micro_v1="model_name,aws_access_key,aws_secret_key,region_name" #model_name="amazon.nova-micro-v1:0"
LLM_MODEL_CONFIG_bedrock_nova_lite_v1="model_name,aws_access_key,aws_secret_key,region_name" #model_name="amazon.nova-lite-v1:0"
LLM_MODEL_CONFIG_bedrock_nova_pro_v1="model_name,aws_access_key,aws_secret_key,region_name" #model_name="amazon.nova-pro-v1:0"
LLM_MODEL_CONFIG_fireworks_deepseek_r1="model_name,fireworks_api_key" #model_name="accounts/fireworks/models/deepseek-r1"
LLM_MODEL_CONFIG_fireworks_deepseek_v3="model_name,fireworks_api_key" #model_name="accounts/fireworks/models/deepseek-v3"
MAX_TOKEN_CHUNK_SIZE=2000 #Max token used to process/extract the file content.
LLM_MODEL_CONFIG_fireworks_deepseek_v3="model_name,fireworks_api_key" #model_name="accounts/fireworks/models/deepseek-v3"
1 change: 1 addition & 0 deletions backend/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,4 @@ rouge_score==0.1.2
langchain-neo4j==0.3.0
pypandoc-binary==1.15
chardet==5.2.0
google-cloud-secret-manager==2.23.1
16 changes: 8 additions & 8 deletions backend/score.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ async def __call__(self, scope: Scope, receive: Receive, send: Send):
)
app.add_middleware(SessionMiddleware, secret_key=os.urandom(24))

is_gemini_enabled = os.environ.get("GEMINI_ENABLED", "False").lower() in ("true", "1", "yes")
is_gemini_enabled = get_value_from_env_or_secret_manager("GEMINI_ENABLED", False, "bool")
if is_gemini_enabled:
add_routes(app,ChatVertexAI(), path="/vertexai")

Expand Down Expand Up @@ -381,7 +381,7 @@ async def post_processing(uri=Form(None), userName=Form(None), password=Form(Non
api_name = 'post_processing/enable_hybrid_search_and_fulltext_search_in_bloom'
logging.info(f'Full Text index created')

if os.environ.get('ENTITY_EMBEDDING','False').upper()=="TRUE" and "materialize_entity_similarities" in tasks:
if get_value_from_env_or_secret_manager("ENTITY_EMBEDDING",False,"bool") and "materialize_entity_similarities" in tasks:
await asyncio.to_thread(create_entity_embedding, graph)
api_name = 'post_processing/create_entity_embedding'
logging.info(f'Entity Embeddings created')
Expand Down Expand Up @@ -551,7 +551,7 @@ async def connect(uri=Form(None), userName=Form(None), password=Form(None), data
start = time.time()
graph = create_graph_database_connection(uri, userName, password, database)
result = await asyncio.to_thread(connection_check_and_get_vector_dimensions, graph, database)
gcs_file_cache = os.environ.get('GCS_FILE_CACHE')
gcs_file_cache = get_value_from_env_or_secret_manager("GCS_FILE_CACHE",False, "bool")
end = time.time()
elapsed_time = end - start
json_obj = {'api_name':'connect','db_url':uri, 'userName':userName, 'database':database, 'count':1, 'logging_time': formatted_time(datetime.now(timezone.utc)), 'elapsed_api_time':f'{elapsed_time:.2f}','email':email}
Expand Down Expand Up @@ -1034,11 +1034,11 @@ async def fetch_chunktext(
async def backend_connection_configuration():
try:
start = time.time()
uri = os.getenv('NEO4J_URI')
username= os.getenv('NEO4J_USERNAME')
database= os.getenv('NEO4J_DATABASE')
password= os.getenv('NEO4J_PASSWORD')
gcs_file_cache = os.environ.get('GCS_FILE_CACHE')
uri = get_value_from_env_or_secret_manager("NEO4J_URI")
username= get_value_from_env_or_secret_manager("NEO4J_USERNAME")
database= get_value_from_env_or_secret_manager("NEO4J_DATABASE")
password= get_value_from_env_or_secret_manager("NEO4J_PASSWORD")
gcs_file_cache = get_value_from_env_or_secret_manager("GCS_FILE_CACHE",False, "bool")
if all([uri, username, database, password]):
graph = Neo4jGraph()
logging.info(f'login connection status of object: {graph}')
Expand Down
8 changes: 4 additions & 4 deletions backend/src/QA_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,11 @@

# Local imports
from src.llm import get_llm
from src.shared.common_fn import load_embedding_model
from src.shared.common_fn import get_value_from_env_or_secret_manager, load_embedding_model
from src.shared.constants import *
load_dotenv()

EMBEDDING_MODEL = os.getenv('EMBEDDING_MODEL')
EMBEDDING_MODEL = get_value_from_env_or_secret_manager("EMBEDDING_MODEL")
EMBEDDING_FUNCTION , _ = load_embedding_model(EMBEDDING_MODEL)

class SessionChatHistory:
Expand Down Expand Up @@ -397,7 +397,7 @@
neo_db = initialize_neo4j_vector(graph, chat_mode_settings)
# document_names= list(map(str.strip, json.loads(document_names)))
search_k = chat_mode_settings["top_k"]
ef_ratio = int(os.getenv("EFFECTIVE_SEARCH_RATIO", "2")) if os.getenv("EFFECTIVE_SEARCH_RATIO", "2").isdigit() else 2
ef_ratio = get_value_from_env_or_secret_manager("EFFECTIVE_SEARCH_RATIO", 5, "int")
retriever = create_retriever(neo_db, document_names,chat_mode_settings, search_k, score_threshold,ef_ratio)
return retriever
except Exception as e:
Expand All @@ -410,10 +410,10 @@
start_time = time.time()
try:
if model == "diffbot":
model = os.getenv('DEFAULT_DIFFBOT_CHAT_MODEL')
model = get_value_from_env_or_secret_manager("DEFAULT_DIFFBOT_CHAT_MODEL","openai_gpt_4o")

llm, model_name = get_llm(model=model)
logging.info(f"Model called in chat: {model} (version: {model_name})")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
This expression logs
sensitive data (secret)
as clear text.
This expression logs
sensitive data (secret)
as clear text.

retriever = get_neo4j_retriever(graph=graph, chat_mode_settings=chat_mode_settings, document_names=document_names)
doc_retriever = create_document_retriever_chain(llm, retriever)
Expand Down
4 changes: 2 additions & 2 deletions backend/src/communities.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from langchain_core.output_parsers import StrOutputParser
from concurrent.futures import ThreadPoolExecutor, as_completed
import os
from src.shared.common_fn import load_embedding_model
from src.shared.common_fn import get_value_from_env_or_secret_manager, load_embedding_model


COMMUNITY_PROJECTION_NAME = "communities"
Expand Down Expand Up @@ -351,9 +351,9 @@

def create_community_embeddings(gds):
try:
embedding_model = os.getenv('EMBEDDING_MODEL')
embedding_model = get_value_from_env_or_secret_manager("EMBEDDING_MODEL")
embeddings, dimension = load_embedding_model(embedding_model)
logging.info(f"Embedding model '{embedding_model}' loaded successfully.")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
This expression logs
sensitive data (secret)
as clear text.
This expression logs
sensitive data (secret)
as clear text.
This expression logs
sensitive data (secret)
as clear text.

logging.info("Fetching community details.")
rows = gds.run_cypher(GET_COMMUNITY_DETAILS)
Expand Down
3 changes: 2 additions & 1 deletion backend/src/create_chunks.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from langchain_text_splitters import TokenTextSplitter
from langchain.docstore.document import Document
from src.shared.common_fn import get_value_from_env_or_secret_manager
from langchain_neo4j import Neo4jGraph
import logging
from src.document_sources.youtube import get_chunks_with_timestamps, get_calculated_timestamps
Expand All @@ -26,7 +27,7 @@ def split_file_into_chunks(self,token_chunk_size, chunk_overlap):
"""
logging.info("Split file into smaller chunks")
text_splitter = TokenTextSplitter(chunk_size=token_chunk_size, chunk_overlap=chunk_overlap)
MAX_TOKEN_CHUNK_SIZE = int(os.getenv('MAX_TOKEN_CHUNK_SIZE', 10000))
MAX_TOKEN_CHUNK_SIZE = get_value_from_env_or_secret_manager("MAX_TOKEN_CHUNK_SIZE",10000, "int")
chunk_to_be_created = int(MAX_TOKEN_CHUNK_SIZE / token_chunk_size)

if 'page' in self.pages[0].metadata:
Expand Down
3 changes: 2 additions & 1 deletion backend/src/document_sources/youtube.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from langchain.docstore.document import Document
from src.shared.common_fn import *
from src.shared.llm_graph_builder_exception import LLMGraphBuilderException
from youtube_transcript_api import YouTubeTranscriptApi
import logging
Expand All @@ -11,7 +12,7 @@

def get_youtube_transcript(youtube_id):
try:
proxy = os.environ.get("YOUTUBE_TRANSCRIPT_PROXY")
proxy = get_value_from_env_or_secret_manager("YOUTUBE_TRANSCRIPT_PROXY")
proxies = { 'https': proxy }
transcript_pieces = YouTubeTranscriptApi.get_transcript(youtube_id, proxies = proxies)
return transcript_pieces
Expand Down
Loading
Loading