Sunday, November 19, 2023

Architect perspectives - quick notes

 


Systems engineering techniques 

1) Architecture modeling

2) Alternative analysis

    Analysis of Alternatives (Alternate Analysis) (refer here)

    a) How can we increase our benefits?

    b) How can we realize the benefits sooner?

    c) How can we lower our costs?

    d) How can we push our costs to the future?

    Magnitude effect & timing effect analysis  (NPV, RPI & Payback)


Add the assumptions as well for sure.


3) Tradeoff analysis

    Performance, Scalability, Extensibility, Agility, Maintainability, Feasibility

    ATAM - Architectural Tradeoff Analysis Method

        ... (Scales of justice)
        --> Proposed Architecture, Business drivers, Quality attributes 
               results in to --> Validated and approved Architecture



CBAM - Cost-Benefit Analysis Method

    

other references:

    The budget and timeframe are tight... then!
    If tries to satisfy everyone, every need... !
  

4) Portfolio analysis.




Six generic lifecycle stages through which a system evolves
Concept
Development,
Production
Utilization
Support
Retirement 

Risk assessment


Other concepts

ROI - return on investment
NPV - Net present value
SWOT analysis - Strengths, Weaknesses, Opportunities, Threats (refer here)
Low code No code architecture
Live document.

Factors to consider: (being updated... )

Organization's Vision
Organization's Technology roadmap
Budget
Dependencies, Planning


Useful links


Monday, August 22, 2022

Yarn/Spark Log aggregation - log4j, RootLogger, Appenders, yarn.log-aggregation-enable

 

Reference: https://medium.com/@iacomini.riccardo/spark-logging-configuration-in-yarn-faf5ba5fdb01

In Cloudera yarn.log-aggregation-enable is enabled by default

yarn-site.xml

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.log-aggregation.roll-monitoring-interval- seconds</name>
<value>3600</value>
</property>


Friday, August 19, 2022

KaflaProducer and SparkStreaming - things to remember - auth keystore load & file ulimit - too many open files error

 

When Spark Streaming or Batch process writes to Kafka Topic

Be aware of the authentication done by org.apache.kafka.client.producer.KafkaPublisher send/doSend methods - which loads the keystore/keytab for authentication for each message it publishes

Which means it will read the keytab from the system so many times - so as the message count increases, the number of open file handles will increase extensively and can potentially cause the ULIMIT (max # of open file handles) to exceed causing too many open files error thrown and job fails

Make sure to publish the RDD - distributed way

Tuesday, December 21, 2021

MessageDigest & threadSafety: duplicate hash values on large concurrent processing (Spark-UDF)

Issue: MessageDigest was used as a singleton for this UDF exposed function. As MessageDigest is not threadSafe, it caused duplicate hash values when a large volume of concurrent data was processed.

Solution: So changed the logic to initialize the object for each call.

So see below the old and new approaches highlighted below (obsolete code can be removed) 



Further reference:

how-to-solve-non-serializable-errors-when-instantiating-objects-in-spark-udfs/

need-thread-safe-messagedigest-in-java

Sunday, May 2, 2021

MySQL - implement Windows Function like logic in MySQL 5.6 version ( windows fn is there fro 8.0)

 MySQL - WINDOWS FUNCTION Implementation without Windows function 

#this is needed in MySQL version below version 8.0 

Scenario:

There are records with seemingly PrimaryKey column (cant make it PK, and so duplicate can come in) , a ForeignKey and createTImestamp --> want to make sure that processing happens only once for the primaryKey  - if there are duplicate entries - only way to find it is from  the order of create timestamp and if processed already, foreignKey column will have value.

So want to ientify is there was a prior ForeignKey assigned - which means -   its already processed --- AND  if at all there are 2 entries which are not processed yet, make sure to process just the first one and mark other one as cancelled



SELECT id,
     crew_id,
     amount,
     CASE type 
         WHEN @curType THEN @curRow := @curRow + 1 
         ELSE @curRow := 1
     END AS rank,
     @curType := type AS type
FROM Table1 p
JOIN (SELECT @curRow := 0, @curType := '') r
ORDER BY crew_id, type


### Getting the rank based on execId & createTimestamp

SELECT foreignId, createTimestamp,
		CASE executionId 
			WHEN @curType THEN @curRow := @curRow + 1 
			ELSE @curRow := 1 
		END AS rank,
        @curType := executionId AS executionId
FROM  job_queue_prd p
JOIN (SELECT @curRow := 0, @curType := '') r
WHERE p.executionID in (204626, 204851) 
ORDER BY  executionId, createTimestamp asc;

### Getting current & prior instanceId & the rank based on execId & createTimestamp

SET @instanceId=0; SELECT foreignId, createTimestamp, @instanceId prior_foreignId, @foreignId:=foreignId foreignId,
CASE executionId WHEN @curExecId THEN @curRow := @curRow + 1 ELSE @curRow := 1 END AS rank, @curExecId := executionId AS executionId FROM job_queue_prd p JOIN (SELECT @curRow := 0, @curExecId := '') r WHERE p.executionID in (204626, 204851) ORDER BY executionId, createTimestamp asc;

Saturday, April 24, 2021

GIT Repo , Repo Management Systems - BitBucket vs GitHub

 


Git is a distributed version control system.


Stash is a repository management system of Atlassian - in 2015 its renamed to BitBucket - So, its a management tool for Git


There are many repository management systems you can use with Git. 


One of the most popular is Stash in Enterprise world. 


But when you look to open source world, GitHub is the most popular one as far as I know.


Ref: https://stackoverflow.com/questions/32294534/what-is-the-relationship-between-git-and-stash